Open Kirk3gaard opened 4 years ago
This definitely is something that I need to figure out. Originally all of the scripts were quickly written in python, which isn't as conducive to multithreading as C/C++ but I will look and see if there is an easy solution.
Hi,
I stumbled across the same issue and maybe you can also just build a lightweight solution. As the costly part is the extraction of reads mapping the precomputed list of ids, I slightly modified the script to simply store the ids to some .txt file (or two files in case of paired-end), which I then use as input for seqtk subseq
or seqkit grep
to perform the read extraction.
As I use this in a workflow, for me it is fine to do it in two separate steps. But you could maybe also just create a subprocess to call one of these tools and make it a requirement for the user environment (maybe optional).
Just an idea...
Best, Sandro
Hi @andreott , Could you possibly share your solution with me? Kind regards, Morten
There's a better tools (imo) seqkit
that will perform similar function. Create a file with containing the taxon ids you want, taxons.txt
eg:
464095
12058
12059
138948
You can extract the reads matching using:
seqkit grep -r -f taxons.txt --threads 4 classified_1.fastq.gz > taxons.fastq.gz
More details: https://bioinf.shenwei.me/seqkit/usage/#grep
Getting the taxon list can be automated with TaxonKit
https://bioinf.shenwei.me/taxonkit/download/ (same author as seqkit
)
I'm a fan of seqkit
, but unfortunately it is insufficient if you want to include sequences from the parent clade levels, which should be included in the tool features (please fix it!)
Hi
Thanks for a great tool. Great to be able to process the output files of kraken without knowing all of the TAXIDs and how they are linked together.
Have you considered making extract_kraken_reads support multi-threading to run faster on computers with multiple CPUs? It seems that it uses only one CPU.
I naively just split my fastq files to run it through with parallel but apparently loading the database once for each parallel process quickly maxed out the ´1 TB RAM in the computer I was using.
Best regards Rasmus