How to filter contaminated reads from the original FASTQ data using the pipeline output?

rohitfarmer commented 6 months ago

Hi there, so I implemented the pull request https://github.com/wasade/exhaustive/pull/3#issuecomment-1987242528 and ran the code on my dataset. It was successful, albeit very slow. I couldn't run it as a batch submission, but I managed to run it on an interactive node with 16 CPU cores.

In the output, I have .masked.fna file for each read, .contaminated.fna file for each read and a folder "40007-D_RNA_S37_R1_001-bt2" with files like "40007-D_RNA_S37_R1_001-bt2.1.bt2, 40007-D_RNA_S37_R1_001-bt2.3.bt2, 40007-D_RNA_S37_R1_001-bt2.rev.1.bt2, 40007-D_RNA_S37_R1_001-bt2.2.bt2, 40007-D_RNA_S37_R1_001-bt2.4.bt2, 40007-D_RNA_S37_R1_001-bt2.rev.2.bt2".

In my understanding, the masked files contain reads with NNs, contaminated files contain reads that were masked with NNs, and the folder *-bt2 contains the bowtie index.

Now, for the subsequent analysis using CZID, I need FASTQ files. Using the elements from the output, how can I remove contaminated reads from my original FASTQ dataset? Any suggestions would be helpful. Thanks!

wasade commented 6 months ago

Hi @rohitfarmer, I apologize for the delays in replies here and the PRs. Could you describe further what your inputs were?

To filter FASTQs, you could map the sample data against the masked database with bowtie2, and use samtools to post process the .sam output to obtain FASTQ for the reads which align. The parameters we typically use, for short read metagenomic data, are derived from SHOGUN and can be found here

rohitfarmer commented 6 months ago

Thank you, @wasade. We are working on it the way you suggested.

wasade commented 6 months ago

Thanks!

wasade / exhaustive

How to filter contaminated reads from the original FASTQ data using the pipeline output? #6