Closed JWDebler closed 3 years ago
Well, one workaround seems to be to set the --batch_size
crazy high, like 1000000. You then still have to translate batch0.fast5
to the corresponding barcode.
Hi @JWDebler -- check out fast5_subset
, which will take a summary file or a list of read_ids and extract all of those from your input set. You could use that to subset your fast5s if you get the read_ids for each barcode into their own file. I realize this isn't a perfect solution, and we've considered adding a script to directly separate out by barcode, but we haven't gotten around to it yet.
Hi @fbrennen , that is what I have tried. But with default parameters I just get hundreds of batchxx.fast5
files which I then have to figure out through the the filename_mapping.txt
which belong to which barcode. If I set the--batch_size
parameter really high at least all the same barcodes end up in the same batch file, but I will still have to look up which one corresponds to which barcode. Ideally this should be integrated directly into guppy, and guppy's fast5_out
option should then separate your fast5s by barcode, just like it does the fastqs. But in the meantime, yes, an additional script (or tag for fast5_subset
) that looks up the readID in the sequencing_summary
and then creates a folder structure the way guppy does during demultiplexing would be awesome.
Here is what sort of works, but could be turned into a nice bash script:
cat folder_with_fastq_files/sequencing_summary.txt | cut -f 21 | grep barcode[0-9] | sort | uniq > barcodes.txt
while read -r barcode
do
head -n 1 folder_with_fastq_files/sequencing_summary.txt > $barcode.txt
cat folder_with_fastq_files/sequencing_summary.txt | grep $barcode >> $barcode.txt
for f in *.txt
do
fast5_subset -i allfast5 -s demultiplexed_fast5/$barcode -l $f -t 14
done
done < barcodes.txt
Now you end up with a folder structure that looks like what you get after guppy demultiplexing :-)
Hi @JWDebler -- that looks like a good intermediate solution. fast5_subset
also has an argument --filename_base
which you can use to change that batch_
to barcodeXX_
:
fast5_subset -i <folder> -l barcodeXX_list.txt -f barcodeXX
The reason guppy doesn't do this automatically for fast5 files is because those kinds of hdf operations are very slow, and will dramatically reduce the speed of basecalling unless we make some largish modifications. We will try and get an additional script to do this for you though.
Hi @fbrennen
Hi @JWDebler -- check out
fast5_subset
, which will take a summary file or a list of read_ids and extract all of those from your input set. You could use that to subset your fast5s if you get the read_ids for each barcode into their own file. I realize this isn't a perfect solution, and we've considered adding a script to directly separate out by barcode, but we haven't gotten around to it yet.
The summary file is the sequencing_summary.txt directly from MinIon run or the output of guppy basecaller?
Hi @aspitaleri, yes, it is the sequencing_summary.txt
Thanks. Is it from after the basecalling performed by guppy OR it is file produced by MinIon during the run?
Hi @aspitaleri -- both guppy and MinKNOW are capable of producing summary files, but there should be only one produced from any particular action:
sequencing_summary.txt
file during your run.sequencing_summary.txt
file as it runs.
Is it possible to directly subset fast5s by barcode? I saw the 'filename_mapping.txt' file which one could compare to the 'sequencing_summary.txt' to eventually tease out the files belonging to the same barcode. But couldn't that be implemented directly as an option into fast5_subset? Or is this maybe already possible? Cheers