Can I subsample Nanopore fast5/fastq with this tool?

nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software

Other

144 stars 28 forks source link

Can I subsample Nanopore fast5/fastq with this tool? #36

Closed jolespin closed 4 years ago

jolespin commented 4 years ago

I need to subsample some Nanopore reads but I have multiple mulitfast5 files.

Can I merge these types of file with your tool?

I was thinking about loading all of the files using get_fast5_file and then writing the files together.

How can I write fast5 files?

Is there a way to keep the length distributions with the original read set?

fbrennen commented 4 years ago

Hi @jolespin -- check out fast5_subset:

https://github.com/nanoporetech/ont_fast5_api#fast5_subset

That should do exactly what you're after.

jolespin commented 4 years ago

Awesome! Yea, I'm running nanopolish call-methylation right now and it is taking forever from some of the prometheon runs. Thanks!

Are there any plans to do a merge_multifasta type script?

jolespin commented 4 years ago

Also, what is the summary file that could be used for this arg? -l,--read_id_list <(file) either sequencing_summary.txt file or a file containing a list of read_ids>

fbrennen commented 4 years ago

Hi @jolespin -- I'd recommend coming into the Nanopore Community site, checking out our documentation, and asking general questions there -- it's a friendly bunch of people and our CS reps can answer most of your questions.

With regards to a merge_multifasta, I'm not sure quite what you're after -- if you want to merge together fast5 files then that's essentially a "subset" of all the reads in your files, with a very large output batch size (i.e. fast5_subset -i <your fast5 directory> -s <somewhere> -l <a list of all the read_ids in your files> -n <a large number>). I would not recommend doing that though, as some versions of hdf5 have issues with particularly large files, and you'll reduce your ability to operate on files in parallel (which is important because hdf5 has a global lock for file operations, so you can only do one operation at a time within a single process).

jolespin commented 4 years ago

Thanks, I will look into this. It looks like a paid account is required for posting in the community. From your response, it seems like the best option is to do the following:

(1) Identify a threshold for the number of bases; (2) Sort the reads by quality; (3) Sum up the number of bases in all the reads until I hit the desired threshold; (4) Export the fast5 from that subset

Here's the distribution of my total bases:

My reference genome is Rattus_norvegicus.Rnor_6.0 with is made up of 955 scaffolds totaling 2870184193 bases. I'm not sure if this information helps.

fbrennen commented 4 years ago

Hi @jolespin -- apologies; we're looking into better Community access that will allow you to post questions if you're not a paying customer (though if you work with one I believe that they can give you access). What you've posted above seems generally reasonable to me, though I'm not much of a bioinformatician.

If you think you've got enough to work with for now then I'll go ahead and close this.

jolespin commented 4 years ago

Yea I need to look into that. There are a few people listed from my institute but I don't think they are currently with us. I'll try contacting them separately. Yes, closing the issue is ok you were very helpful. Thanks again.