Question: subset by barcode

JWDebler commented 3 years ago

Is it possible to directly subset fast5s by barcode? I saw the 'filename_mapping.txt' file which one could compare to the 'sequencing_summary.txt' to eventually tease out the files belonging to the same barcode. But couldn't that be implemented directly as an option into fast5_subset? Or is this maybe already possible? Cheers

JWDebler commented 3 years ago

Well, one workaround seems to be to set the --batch_size crazy high, like 1000000. You then still have to translate batch0.fast5 to the corresponding barcode.

fbrennen commented 3 years ago

Hi @JWDebler -- check out fast5_subset, which will take a summary file or a list of read_ids and extract all of those from your input set. You could use that to subset your fast5s if you get the read_ids for each barcode into their own file. I realize this isn't a perfect solution, and we've considered adding a script to directly separate out by barcode, but we haven't gotten around to it yet.

JWDebler commented 3 years ago

Hi @fbrennen , that is what I have tried. But with default parameters I just get hundreds of batchxx.fast5 files which I then have to figure out through the the filename_mapping.txt which belong to which barcode. If I set the--batch_size parameter really high at least all the same barcodes end up in the same batch file, but I will still have to look up which one corresponds to which barcode. Ideally this should be integrated directly into guppy, and guppy's fast5_out option should then separate your fast5s by barcode, just like it does the fastqs. But in the meantime, yes, an additional script (or tag for fast5_subset) that looks up the readID in the sequencing_summary and then creates a folder structure the way guppy does during demultiplexing would be awesome.

JWDebler commented 3 years ago

Here is what sort of works, but could be turned into a nice bash script:

cat folder_with_fastq_files/sequencing_summary.txt | cut -f 21 | grep barcode[0-9] |  sort | uniq > barcodes.txt

while read -r barcode
  do 
     head -n 1 folder_with_fastq_files/sequencing_summary.txt > $barcode.txt
     cat folder_with_fastq_files/sequencing_summary.txt | grep $barcode >> $barcode.txt

     for f in *.txt
     do 
          fast5_subset -i allfast5 -s demultiplexed_fast5/$barcode -l $f -t 14
    done

  done < barcodes.txt

Now you end up with a folder structure that looks like what you get after guppy demultiplexing :-)

fbrennen commented 3 years ago

Hi @JWDebler -- that looks like a good intermediate solution. fast5_subset also has an argument --filename_base which you can use to change that batch_ to barcodeXX_: fast5_subset -i <folder> -l barcodeXX_list.txt -f barcodeXX

The reason guppy doesn't do this automatically for fast5 files is because those kinds of hdf operations are very slow, and will dramatically reduce the speed of basecalling unless we make some largish modifications. We will try and get an additional script to do this for you though.

aspitaleri commented 3 years ago

Hi @fbrennen

Hi @JWDebler -- check out fast5_subset, which will take a summary file or a list of read_ids and extract all of those from your input set. You could use that to subset your fast5s if you get the read_ids for each barcode into their own file. I realize this isn't a perfect solution, and we've considered adding a script to directly separate out by barcode, but we haven't gotten around to it yet.

The summary file is the sequencing_summary.txt directly from MinIon run or the output of guppy basecaller?

JWDebler commented 3 years ago

Hi @aspitaleri, yes, it is the sequencing_summary.txt

aspitaleri commented 3 years ago

Thanks. Is it from after the basecalling performed by guppy OR it is file produced by MinIon during the run?

fbrennen commented 3 years ago

Hi @aspitaleri -- both guppy and MinKNOW are capable of producing summary files, but there should be only one produced from any particular action:

If you have enabled live basecalling, then MinKNOW will produce a sequencing_summary.txt file during your run.
If you are basecalling with guppy on its own, or if you are using MinKNOW's post-run basecalling, then guppy will produce a sequencing_summary.txt file as it runs.

fbrennen commented 3 years ago

Hi both,

We've put together a dedicated demultiplexing tool, fast5_demux -- give it a try and see what you think. Instructions are on the main page here.

nanoporetech / ont_fast5_api

Question: subset by barcode #45