mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
203 stars 17 forks source link

Multi-threading approach #55

Open Teklu67 opened 2 years ago

Teklu67 commented 2 years ago

Hi, This is a very useful program but it is taking long time to sub-sample from a large fastq file. I am running it on a server and would like to run it using multi-threading but I am novice to programming and not sure how to do that. Any help please? Thanks,

mbhall88 commented 2 years ago

Hi @Teklu67. When you say "a long time", how long are we talking? And how large is your file?

Teklu67 commented 2 years ago

Thanks so much for the quick response. It finished sampling 30x from a fq of 690 Gb (60x coverage) in 2 days. Because I have the resources to run using several threads I thought it will finish much faster if there was an option for multi-threading. Thanks!

mbhall88 commented 2 years ago

Wow, that's a very big fastq file! Is it compressed (e.g., gzip)?

How did you install rasusa?

Teklu67 commented 2 years ago

Yes it is for tetraploid wheat and compressed .gz format. I installed it through conda.

mbhall88 commented 2 years ago

Is your data Illumina?

There's not really too much I can offer in the way of speeding rasusa up sorry.

At some point I will look into whether multi-threading the IO is possible (i.e. batching reads).

I'll leave this open and add it to my list of things to investigate in the coming months. Sorry, I can't do it faster, but have a lot of other research projects I am trying to juggle.

However, if you (or anyone else) would like to have a go at it, I would be very happy to receive a pull request.

Teklu67 commented 2 years ago

It is ONT data. That is ok, thank you for your time

mbhall88 commented 2 years ago

In the mean time, I would suggest maybe trying to split the file up into subsets, and then randomly subsample each subset.