extract_kraken_reads to support multithreading?

jenniferlu717 / KrakenTools

KrakenTools provides individual scripts to analyze Kraken/Kraken2/Bracken/KrakenUniq output files

GNU General Public License v3.0

315 stars 89 forks source link

extract_kraken_reads to support multithreading? #19

Open Kirk3gaard opened 4 years ago

Kirk3gaard commented 4 years ago

Thanks for a great tool. Great to be able to process the output files of kraken without knowing all of the TAXIDs and how they are linked together.

Have you considered making extract_kraken_reads support multi-threading to run faster on computers with multiple CPUs? It seems that it uses only one CPU.

I naively just split my fastq files to run it through with parallel but apparently loading the database once for each parallel process quickly maxed out the ´1 TB RAM in the computer I was using.

Best regards Rasmus

jenniferlu717 commented 4 years ago

This definitely is something that I need to figure out. Originally all of the scripts were quickly written in python, which isn't as conducive to multithreading as C/C++ but I will look and see if there is an easy solution.

andreott commented 2 years ago

Hi,

I stumbled across the same issue and maybe you can also just build a lightweight solution. As the costly part is the extraction of reads mapping the precomputed list of ids, I slightly modified the script to simply store the ids to some .txt file (or two files in case of paired-end), which I then use as input for seqtk subseq or seqkit grep to perform the read extraction.

As I use this in a workflow, for me it is fine to do it in two separate steps. But you could maybe also just create a subprocess to call one of these tools and make it a requirement for the user environment (maybe optional).

Just an idea...

Best, Sandro

MortenEneberg commented 2 years ago

Hi @andreott , Could you possibly share your solution with me? Kind regards, Morten

ammaraziz commented 1 year ago

There's a better tools (imo) seqkit that will perform similar function. Create a file with containing the taxon ids you want, taxons.txt eg:

You can extract the reads matching using:

seqkit grep -r -f taxons.txt --threads 4 classified_1.fastq.gz > taxons.fastq.gz

More details: https://bioinf.shenwei.me/seqkit/usage/#grep

Getting the taxon list can be automated with TaxonKit https://bioinf.shenwei.me/taxonkit/download/ (same author as seqkit)

chiaramazzoni commented 1 year ago

I'm a fan of seqkit, but unfortunately it is insufficient if you want to include sequences from the parent clade levels, which should be included in the tool features (please fix it!)