Closed rohitfarmer closed 6 months ago
Hi @rohitfarmer, I apologize for the delays in replies here and the PRs. Could you describe further what your inputs were?
To filter FASTQs, you could map the sample data against the masked database with bowtie2, and use samtools to post process the .sam output to obtain FASTQ for the reads which align. The parameters we typically use, for short read metagenomic data, are derived from SHOGUN and can be found here
Thank you, @wasade. We are working on it the way you suggested.
Thanks!
Hi there, so I implemented the pull request https://github.com/wasade/exhaustive/pull/3#issuecomment-1987242528 and ran the code on my dataset. It was successful, albeit very slow. I couldn't run it as a batch submission, but I managed to run it on an interactive node with 16 CPU cores.
In the output, I have .masked.fna file for each read, .contaminated.fna file for each read and a folder "40007-D_RNA_S37_R1_001-bt2" with files like "40007-D_RNA_S37_R1_001-bt2.1.bt2, 40007-D_RNA_S37_R1_001-bt2.3.bt2, 40007-D_RNA_S37_R1_001-bt2.rev.1.bt2, 40007-D_RNA_S37_R1_001-bt2.2.bt2, 40007-D_RNA_S37_R1_001-bt2.4.bt2, 40007-D_RNA_S37_R1_001-bt2.rev.2.bt2".
In my understanding, the masked files contain reads with NNs, contaminated files contain reads that were masked with NNs, and the folder *-bt2 contains the bowtie index.
Now, for the subsequent analysis using CZID, I need FASTQ files. Using the elements from the output, how can I remove contaminated reads from my original FASTQ dataset? Any suggestions would be helpful. Thanks!