metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
368 stars 97 forks source link

single_end reads are left alone after trimming #20

Closed SilasK closed 6 years ago

SilasK commented 6 years ago

https://github.com/pnnl/atlas/blob/ea2bbf94e420f6aa125815d29f3f0ab97ec6c790/atlas/rules/assemble.snakefile#L59

when processing paired end reads, some may be left as singletons, they are stored in the file {sample}/sequence_quality_control/{sample}_00_se.fastq.gz, but not taken up by the following rules.

SilasK commented 6 years ago

I would recommend implementing read joining after bbduk2.sh then concatenate joined reads, non-joined reads, and quality controlled orphans. The remainder of the protocol would then just be single-end processing.

I'm don't think merging read pairs is always a good solution. e.g. on the BBmerge User guide they wrote:

When not to use: If you run BBMerge, and under, say, 15% of the reads merge, even at very loose stringency, it’s probably a waste of time to merge – you’ll just make the workflow more complicated, and possibly get a lot of false-positives.

So, I think merging read-pairs should be optional. e.g. by a flag in the config file.

I fear there are always read pairs which don't merge and, depending on the proportion, it would be a waste to loose the linkage information. Is there no other possibility to keep the un-merged paired-end, the merged and the singletons in one interleaved file?

SilasK commented 6 years ago

At which stage should we do paired-end reads merging?

BBmerge can do trimming, error correction, read-extension and merging of paired end reads.

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict

BBMerge will then attempt to merge each read pair. If unsuccessful, it will quality-trim the right end of each read to Q12, and try again (qtrim2=r trimq=12). If still unsuccessful, it will try to extend the reads by up to 20bp on the right end only, and try merging again, up to 10 times (extend2=20 iterations=10). This allows up to 200bp extension for each read, so that 2x250 reads can still merge even with an insert size approaching 900bp, near the limit of Illumina bridge-amplification. I recommend this over extending first then merging. (from here)   There may be advantages in doing merging and error correction in one go. But may be it's saver to do the trimming first and then error-correction + merging using a recommended for optimal accuracy:

bbmerge-auto.sh in=reads.fq out=merged.fq adapter1=something adapter2=something rem k=62 extend2=50 ecct

SilasK commented 6 years ago

I run the pipeline with this modification on paired end reads #21 : Here is the first lines of the output fastq, which are interleaved:

@HISEQ:171:HK53CBCXX:1:1101:4707:2337 1:N:0:TAATGCGCTATAGCCT
GGTAGCGGTTATCCTTAACCAGCTCCCGCAGCCGTTTCTCCGCTTTCGACCAGCTCAAAACAAGGTCGTGTTCGGTGGAAATGCCGTCACGGTTAATGGAGATGCCCTTGCCGCTGTACCACACATACCCCTCCGTGCCGTCTGAAAAACCAACGAAAAAGCCGCCCGTGCCGTATTCCCTTTTCAGAAAATCAATGTTCTTCTTTTTATCTTCCTGCTTC
+
DDDDDHHHIIHIIIIIIIIIIIIIIIIIIIIGHIIIIIIIHIIIIIIIIIHHIIIGIHIIIIIIIHHIIIHIIIIIIIHHIIIIIIIIIIIIHIIIIIICGDGEHHIIIIIIIHIIIHIIIIIIIIIIIIIIIHIIIIIHHHIHHIIIIIIIIIIIIIHGCHHIIHHIHIIHHIIIIIHHIGHHHIIIIHHH@HHIHHHIHIIHHIIIIHHIIGIIEHIHH
@HISEQ:171:HK53CBCXX:1:1101:4707:2337 2:N:0:TAATGCGCTATAGCCT
GAAGCAGGAAGATAAAAAGAAGAACATTGATTTTCTGAAAAGGGAATACGGCACGGGCGGCTTTTTCGTTGGTTTTTCAGACGGCACGGAGGGGTATGTGTGGTACAGCGGCAAGGGCATCTCCATTAACCGTGACGGCATTTCCACCGAACACGACCTTGTTTTGAGCTGGTCGAAAGCGGAGAAACGGCTGCGGGAGCTGGTTAAGGATAACCGCTACC
+
DDDDDHIIHFIHIHIHEHIHIIHIIIIIIIIIGIIIIIIIIIIHIIIIIIIIIIIIGGIIIIIGHIIHHGHIIHHHHIHHIIHHIIIIIIIHHIHHIHIHIFDHHHHIHIIIIHIHGIIIIIIIIGHHIIIIIGIIHHIHDEHHGHIHHHIGHHHIIHIHIHFHIIIGIIIHGEHHEHHIHHIHI?GHIGHHGIIHHIHHIIHFEHFEHCHG-FFD,CHHI
@HISEQ:171:HK53CBCXX:1:1101:4900:2251 1:N:0:TAATGCGCTATAGCCT
TCACATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGGGGAAAAGCCTTGAAGATATGGGCACAGGGGAAAAATTCCTGAACAGAACAGCAATGGCTTGTGTTGTAAAATCTAGAGAATTGACAAATGGGACCTCATAAAATTGCAAAGCTTCTGTAAGGCAAAAGACACTGATAATAAGACAAAATGCCACCAACAGATTTGGAAAGGATATGTACCTATCCTAAATCAGATAGGGGACAAAT

and the last lines, which are single ends:

@HISEQ:171:HK53CBCXX:2:2216:1405:101103 2:N:0:TAATGCGCTATAGCCT
TTTGTTGAACATGGGAATGTTAAGCAGTGCTGGGTATGGTGAGATAAGTAATTTGTTGGTTACACCCATAAGATTTGTGCCACCATTGCATTAGCATTAGCATACCTTGAGGGTGGTACACAAGTAGATAAAGAGGTTTTGGATGGTTTTCTGTTTAGGTTTTTATTTTTCTA
+
<DD<<<<DCCHEG1GEC@@EGFF1C?@@D1<<CGC11<1<1<1<<<<<FF<<<FEFGFEEHC11<?0@FH?@1<<11<1@CFHCH<1<C11<<111<<<1F<<11D1G1<FE0<C<CE1D@FHD1111<GEC<<11<<<C1/<11CDD11111<<<11<1CGC<0<<F10<<@
@HISEQ:171:HK53CBCXX:2:2216:4923:101142 1:N:0:TAATGCGCTATAGCCT
GCTGTTACCTGGATCTTGTTTGTCACGGAGGCAAACAAGGGTGGGATTAAGAGTCACACGTTACAGCCAGGGGAGTGATGATCACTTCCTGCTTTGCTATTTCAATCTGACAAGAACTACAGTAAATCGGTTTTTATGACTAGAGACCCCTTCCTCGGCCATCCTGTTTAACTATTTATTTAGCCCTTGTACCTCTCTACAATTAGAAAGCCAAGTAATCACACAAACACAGATACACCACAGACACTCA
+
DDDCDGHIHHHFHIIIIIIGIHIIHHHHIHHHGIIIIIEHIIGGIIIIIIIIGHHHHIIIHIIIIHIIIHIIIIHHHIIIIIIIHIIHHIIIHIIGIHIHIHHGIGIGEHIIIIGHHHIIIHIIIIIIGIHIIGHIHFFHHIIIICHHHIIHHIIIHFHHGIIIIIHIIHIIIIIIIIHIII@H?GHHHHHHHHIGHIGGEHHHFEHIIHHIIIHHFHHHHIHHEHHAGHCG?7GHHIH?GHHIHHIIHI

they were processed as paired end in the decontamination step, F33_decontamination.log:

So I didn't get the error in bbmap you described. @brwnj, do you think it's a possibility to have all reads in one file paired (interleaved) and single end reads

SilasK commented 6 years ago

from: mail I've been thinking about this a bit and think these are the changes that would encompass the main issues:

SilasK commented 6 years ago

Trimming: Adapter-trimming reads is fine and is recommended prior to running BBDuk, particularly if kmers will be used. Quality-trimming is usually not recommended, unless the library has major quality issues at the right end of reads resulting in a low merge rate, in which case only weak trimming (such as qtrim=r trimq=8) should be used.

from user guide.

It seems no or only light quality trimming is required before reads merging.

SilasK commented 6 years ago

Solved in #27