onecodex / needletail

Fast FASTX parsing and k-mer methods in Rust
MIT License
174 stars 20 forks source link

parallel parsing a lot of fasta files? #62

Closed jianshu93 closed 1 year ago

jianshu93 commented 1 year ago

Hello needletail team,

I have more than 1 million fasta files to parse, each is about 3-5 Mb, total is about 3T, I am wondering how I can read those huge number of files in parallel using all cpu cores I have. It seems there are not such tool available right now. Any suggestions?

Thanks,

Jianshu

Keats commented 1 year ago

Use https://github.com/rayon-rs/rayon to split the work, that's what we do.

jianshu93 commented 1 year ago

Is that the finch crate?

Jianshu

jianshu93 commented 1 year ago

Hello Keats,

Can you please give an example how you will do it. My thinking is that I create an iterator from file path of each fasta file (dirwalk or something) and use par_into_inter() combined with the following parse command:

let mut reader = needletail::parse_fastx_file(&pathb).expect("expecting valid filename");

to allow parallel parsing.

Thanks,

Jianshu

Keats commented 1 year ago

My thinking is that I create an iterator from file path of each fasta file (dirwalk or something) and use par_into_inter() combined with the following parse command:

Yep exactly that. It's pretty much the default example (https://github.com/onecodex/needletail/blob/master/src/lib.rs#L15-L39) except you would have something like that around:

let results: Vec<_> = files.par_iter().map(|f| {

   // Here you can put the snippet from the example
  // I'm returning a vec in that example, maybe you don't need to return anything
  // You can use for_each instead of map then

}).collect();
jianshu93 commented 1 year ago

Thanks!

This is very helpful!

Jianshu