Closed SilasK closed 6 years ago
I would recommend implementing read joining after bbduk2.sh then concatenate joined reads, non-joined reads, and quality controlled orphans. The remainder of the protocol would then just be single-end processing.
I'm don't think merging read pairs is always a good solution. e.g. on the BBmerge User guide they wrote:
When not to use: If you run BBMerge, and under, say, 15% of the reads merge, even at very loose stringency, it’s probably a waste of time to merge – you’ll just make the workflow more complicated, and possibly get a lot of false-positives.
So, I think merging read-pairs should be optional. e.g. by a flag in the config file.
I fear there are always read pairs which don't merge and, depending on the proportion, it would be a waste to loose the linkage information. Is there no other possibility to keep the un-merged paired-end, the merged and the singletons in one interleaved file?
At which stage should we do paired-end reads merging?
BBmerge can do trimming, error correction, read-extension and merging of paired end reads.
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict
BBMerge will then attempt to merge each read pair. If unsuccessful, it will quality-trim the right end of each read to Q12, and try again (qtrim2=r trimq=12). If still unsuccessful, it will try to extend the reads by up to 20bp on the right end only, and try merging again, up to 10 times (extend2=20 iterations=10). This allows up to 200bp extension for each read, so that 2x250 reads can still merge even with an insert size approaching 900bp, near the limit of Illumina bridge-amplification. I recommend this over extending first then merging. (from here) There may be advantages in doing merging and error correction in one go. But may be it's saver to do the trimming first and then error-correction + merging using a recommended for optimal accuracy:
bbmerge-auto.sh in=reads.fq out=merged.fq adapter1=something adapter2=something rem k=62 extend2=50 ecct
I run the pipeline with this modification on paired end reads #21 : Here is the first lines of the output fastq, which are interleaved:
@HISEQ:171:HK53CBCXX:1:1101:4707:2337 1:N:0:TAATGCGCTATAGCCT
GGTAGCGGTTATCCTTAACCAGCTCCCGCAGCCGTTTCTCCGCTTTCGACCAGCTCAAAACAAGGTCGTGTTCGGTGGAAATGCCGTCACGGTTAATGGAGATGCCCTTGCCGCTGTACCACACATACCCCTCCGTGCCGTCTGAAAAACCAACGAAAAAGCCGCCCGTGCCGTATTCCCTTTTCAGAAAATCAATGTTCTTCTTTTTATCTTCCTGCTTC
+
DDDDDHHHIIHIIIIIIIIIIIIIIIIIIIIGHIIIIIIIHIIIIIIIIIHHIIIGIHIIIIIIIHHIIIHIIIIIIIHHIIIIIIIIIIIIHIIIIIICGDGEHHIIIIIIIHIIIHIIIIIIIIIIIIIIIHIIIIIHHHIHHIIIIIIIIIIIIIHGCHHIIHHIHIIHHIIIIIHHIGHHHIIIIHHH@HHIHHHIHIIHHIIIIHHIIGIIEHIHH
@HISEQ:171:HK53CBCXX:1:1101:4707:2337 2:N:0:TAATGCGCTATAGCCT
GAAGCAGGAAGATAAAAAGAAGAACATTGATTTTCTGAAAAGGGAATACGGCACGGGCGGCTTTTTCGTTGGTTTTTCAGACGGCACGGAGGGGTATGTGTGGTACAGCGGCAAGGGCATCTCCATTAACCGTGACGGCATTTCCACCGAACACGACCTTGTTTTGAGCTGGTCGAAAGCGGAGAAACGGCTGCGGGAGCTGGTTAAGGATAACCGCTACC
+
DDDDDHIIHFIHIHIHEHIHIIHIIIIIIIIIGIIIIIIIIIIHIIIIIIIIIIIIGGIIIIIGHIIHHGHIIHHHHIHHIIHHIIIIIIIHHIHHIHIHIFDHHHHIHIIIIHIHGIIIIIIIIGHHIIIIIGIIHHIHDEHHGHIHHHIGHHHIIHIHIHFHIIIGIIIHGEHHEHHIHHIHI?GHIGHHGIIHHIHHIIHFEHFEHCHG-FFD,CHHI
@HISEQ:171:HK53CBCXX:1:1101:4900:2251 1:N:0:TAATGCGCTATAGCCT
TCACATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGGGGAAAAGCCTTGAAGATATGGGCACAGGGGAAAAATTCCTGAACAGAACAGCAATGGCTTGTGTTGTAAAATCTAGAGAATTGACAAATGGGACCTCATAAAATTGCAAAGCTTCTGTAAGGCAAAAGACACTGATAATAAGACAAAATGCCACCAACAGATTTGGAAAGGATATGTACCTATCCTAAATCAGATAGGGGACAAAT
and the last lines, which are single ends:
@HISEQ:171:HK53CBCXX:2:2216:1405:101103 2:N:0:TAATGCGCTATAGCCT
TTTGTTGAACATGGGAATGTTAAGCAGTGCTGGGTATGGTGAGATAAGTAATTTGTTGGTTACACCCATAAGATTTGTGCCACCATTGCATTAGCATTAGCATACCTTGAGGGTGGTACACAAGTAGATAAAGAGGTTTTGGATGGTTTTCTGTTTAGGTTTTTATTTTTCTA
+
<DD<<<<DCCHEG1GEC@@EGFF1C?@@D1<<CGC11<1<1<1<<<<<FF<<<FEFGFEEHC11<?0@FH?@1<<11<1@CFHCH<1<C11<<111<<<1F<<11D1G1<FE0<C<CE1D@FHD1111<GEC<<11<<<C1/<11CDD11111<<<11<1CGC<0<<F10<<@
@HISEQ:171:HK53CBCXX:2:2216:4923:101142 1:N:0:TAATGCGCTATAGCCT
GCTGTTACCTGGATCTTGTTTGTCACGGAGGCAAACAAGGGTGGGATTAAGAGTCACACGTTACAGCCAGGGGAGTGATGATCACTTCCTGCTTTGCTATTTCAATCTGACAAGAACTACAGTAAATCGGTTTTTATGACTAGAGACCCCTTCCTCGGCCATCCTGTTTAACTATTTATTTAGCCCTTGTACCTCTCTACAATTAGAAAGCCAAGTAATCACACAAACACAGATACACCACAGACACTCA
+
DDDCDGHIHHHFHIIIIIIGIHIIHHHHIHHHGIIIIIEHIIGGIIIIIIIIGHHHHIIIHIIIIHIIIHIIIIHHHIIIIIIIHIIHHIIIHIIGIHIHIHHGIGIGEHIIIIGHHHIIIHIIIIIIGIHIIGHIHFFHHIIIICHHHIIHHIIIHFHHGIIIIIHIIHIIIIIIIIHIII@H?GHHHHHHHHIGHIGGEHHHFEHIIHHIIIHHFHHHHIHHEHHAGHCG?7GHHIH?GHHIHHIIHI
they were processed as paired end in the decontamination step, F33_decontamination.log:
So I didn't get the error in bbmap you described. @brwnj, do you think it's a possibility to have all reads in one file paired (interleaved) and single end reads
from: mail I've been thinking about this a bit and think these are the changes that would encompass the main issues:
Trimming: Adapter-trimming reads is fine and is recommended prior to running BBDuk, particularly if kmers will be used. Quality-trimming is usually not recommended, unless the library has major quality issues at the right end of reads resulting in a low merge rate, in which case only weak trimming (such as qtrim=r trimq=8) should be used.
from user guide.
It seems no or only light quality trimming is required before reads merging.
Solved in #27
https://github.com/pnnl/atlas/blob/ea2bbf94e420f6aa125815d29f3f0ab97ec6c790/atlas/rules/assemble.snakefile#L59
when processing paired end reads, some may be left as singletons, they are stored in the file
{sample}/sequence_quality_control/{sample}_00_se.fastq.gz
, but not taken up by the following rules.