ncbi / sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Other
42 stars 5 forks source link

Paired-end Reads #6

Closed bioinfoMMS closed 2 years ago

bioinfoMMS commented 3 years ago

How does the scrubber tool handle paired-end reads? It looks like it takes only a single fastq file as input. If it is given an interleaved fastq file containing forward and reverse reads, does it use the paired information when classifying the reads as human? Or is it better to stitch the reads together with a certain number of 'N's' before giving it to the scrubber tool? Thanks in advance!

multikengineer commented 2 years ago

Sorry for the tardy reply.

If it is given an interleaved fastq file containing forward_ and reverse reads, does it use the paired information when classifying the reads as human?

Yes, it should handle a single interleaved file of paired reads without problem and will remove both pairs if one is found to be human.

is it better to stitch the reads together with a certain number of 'N's' before giving it to the scrubber tool?

No.

multikengineer commented 2 years ago

@bioinfoMMS Have you encountered any problems? Just checking.

bioinfoMMS commented 2 years ago

@multikengineer Thanks for the response and check in! No problems so far, the tool seems to be working just fine on the interleaved Fq file.

multikengineer commented 2 years ago

Thank you @bioinfoMMS , and again apologies for my previous tardy response.

mbhall88 commented 1 year ago

Yes, it should handle a single interleaved file of paired reads without problem and will remove both pairs if one is found to be human.

According to #23 it seems it doesn't remove both pairs?

mikelchtermans commented 1 year ago

Yes, it should handle a single interleaved file of paired reads without problem and will remove both pairs if one is found to be human.

According to #23 it seems it doesn't remove both pairs?

Hi, #23 was written by a colleague, this issue was indeed present in version 2.0.0 . A fix for this issue was to pipe the output to the tool fastqtk with the command 'fastqtk drop-se ' which in turn contained a bug, for which i created a pull request just now: https://github.com/ndaniel/fastqtk/pull/6 . We have not tested 2.1.0 seeing as it was not marked as resolved.

mbhall88 commented 1 year ago

I might suggest using seqfu to deinterleave the output then in that case https://telatin.github.io/seqfu2/tools/deinterleave.html