khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

rextract: zipped fastq files #30

Closed fancyge closed 3 years ago

fancyge commented 3 years ago

Hi, is it possible to add an option for taking zipped fastq files and outputing zipped files as well. This is useful when the data is too large. Thanks!!

khyox commented 3 years ago

Hi @fancyge, thanks for the enhancement suggestion. For processing kraken output you have already the option to taking compressed files (see Kraken: use of compressed files). Which output files are you thinking of? Thanks.

fancyge commented 3 years ago

Hi, thanks a lot for the quick response! I have already got the centrifuge classification files and want to run rextract to filter the reads. I noticed that rextract can only take fastq file but will go wrong with "fastq.gz" file. I prefer "fastq.gz" for storage saving purpose.

Thank you.

khyox commented 3 years ago

OK, I see, it is about rextract. The issue #28 is also an enhancement suggestion, so I will probably address both of them soon.

About the -i option, you have to prepend any taxid to include with such a flag. In your case:

rextract -f my_classification_results.txt multiple -i 9606 -i 452 -1 R1.fq -2 R2.fq
fancyge commented 3 years ago

Thanks! Hope to see it soon! Quite useful tools! By the way, do you have an idea what score (-y) I should feed to rextract for filtering centrifuge results? I have 150bp fastq files from illumina sequencing.

fancyge commented 3 years ago

I just added some commands to make it possible to take&output zipped fastq files. It might not be ideal but can be used at the moment. Thanks.

rextract.py.zip

khyox commented 3 years ago

By the way, do you have an idea what score (-y) I should feed to rextract for filtering centrifuge results? I have 150bp fastq files from illumina sequencing.

My recommendation is to avoid too low minscore (-y flag also) values to filter sequences with low scores. Also, if you have control sequences, you may want to lower ctrlminscore (-z flag also) to have more sequences in the controls and thus more sequences removed after the robust control removal algorithm. So, --minscore 35 and --ctrlminscore 25 could be good values to start with.

khyox commented 3 years ago

I just added some commands to make it possible to take&output zipped fastq files. It might not be ideal but can be used at the moment. Thanks.

rextract.py.zip

Thanks! If you open a PR I'd happy to check it and include it in the master branch.