Closed peterjc closed 5 years ago
This would/could also replace the current hard coded trimming done in thapbi_pict classify-reads
based on the expected position of the ITS1 based on the amplicon primers.
Currently,
$ time thapbi_pict prepare-reads -o /tmp/cut tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz -v
Preparing 1 paired FASTQ files
DEBUG: Temp folder of DNAMIX_S95_L001 is /tmp/tmp_6cf6tku
Calling command: trimmomatic PE -phred33 tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz /tmp/tmp_6cf6tku/trimmomatic_R1.fastq /dev/null /tmp/tmp_6cf6tku/trimmomatic_R2.fastq /dev/null ILLUMINACLIP:/mnt/shared/scratch/synology/pc40583/apps/conda/bin/../share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10
Calling command: pear -f /tmp/tmp_6cf6tku/trimmomatic_R1.fastq -r /tmp/tmp_6cf6tku/trimmomatic_R2.fastq -o /tmp/tmp_6cf6tku/pear
921 unique sequences
Prepared DNAMIX_S95_L001 with 921 unique reads
real 0m6.930s
user 0m6.294s
sys 0m0.418s
With hmmscan
used to filter and trim to the ITS1 region, rather than hard coded left/right cut:
$ time thapbi_pict prepare-reads -o /tmp/hmm tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz -v
Preparing 1 paired FASTQ files
DEBUG: Temp folder of DNAMIX_S95_L001 is /tmp/tmpxqv77ugk
Calling command: trimmomatic PE -phred33 tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz /tmp/tmpxqv77ugk/trimmomatic_R1.fastq /dev/null /tmp/tmpxqv77ugk/trimmomatic_R2.fastq /dev/null ILLUMINACLIP:/mnt/shared/scratch/synology/pc40583/apps/conda/bin/../share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10
Calling command: pear -f /tmp/tmpxqv77ugk/trimmomatic_R1.fastq -r /tmp/tmpxqv77ugk/trimmomatic_R2.fastq -o /tmp/tmpxqv77ugk/pear
Merged paired FASTQ reads into 1251 unique sequences
Cropped down to 892 unique ITS1 sequences
Prepared DNAMIX_S95_L001 with 892 unique reads
real 0m23.578s
user 0m13.025s
sys 0m1.863s
These are representative times from three repeats - it is slower as expected, but not yet a problem.
The output is very similar, here looking at the most abundant entries:
$ grep "^>" /tmp/cut/DNAMIX_S95_L001.prepared.fasta | awk -F"_" '{print $2$1}' | sort -n | tail
13>a935bc95fa4748d56d76bdd91c534df5
37>a26540b661da0669ed602b193bb52b49
78>e8aa0b53937b895bcbd83b1ce87aad64
264>9577aa1a6c2da294ebbfa5c84cff65aa
294>2b8e7a64e493e5c191e92f21596b4256
296>6803c088eb1e53bb2e44bffecbaa0fdd
299>5122dde24762f8e3d6a54e3f79077254
385>0368431c779b6d3b2555f68f90d047c5
819>2ba87367bdbb87cc37521bed773ffa37
967>9e8f051c64c2b9cc3b6fcb27559418ca
$ grep "^>" /tmp/hmm/DNAMIX_S95_L001.prepared.fasta | awk -F"_" '{print $2$1}' | sort -n | tail
13>a935bc95fa4748d56d76bdd91c534df5
37>a26540b661da0669ed602b193bb52b49
83>e8aa0b53937b895bcbd83b1ce87aad64
267>9577aa1a6c2da294ebbfa5c84cff65aa
298>6803c088eb1e53bb2e44bffecbaa0fdd
304>2b8e7a64e493e5c191e92f21596b4256
304>5122dde24762f8e3d6a54e3f79077254
395>0368431c779b6d3b2555f68f90d047c5
829>2ba87367bdbb87cc37521bed773ffa37
988>9e8f051c64c2b9cc3b6fcb27559418ca
Looking at the most abundant entry in both cases, and doing a grep (easy since the FASTA files currently do not use line wrapping) there is a single match for the HMM trimmed output, but for the fix cutting:
$ grep "CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT" /tmp/cut/DNAMIX_S95_L001.prepared.fasta -B 1
>3cf14606c7fd0a5cdd0588566fe69ddd_5
ACCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT
--
>1d2301ab217de092a8f855cfec185c1f_1
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACTGTGGGGACGAAAGTCCTTGCTCTCTTATACACCATCCTCCG
--
>aee62a912e607666a8cf759ed1d26df6_1
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACTG
--
>9e8f051c64c2b9cc3b6fcb27559418ca_967
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT
i.e. With a leading A
5 times, with tailing GTGGGGACGAAAGTCCTTGCTCTCTTATACACCATCCTCCG
once, and with a tailing G
once, and the exact same sequence 967 times.
Note 5 + 1 + 1 + 967 = 974, but with the HMM we reported a higher abundance of 988.
This suggests in some cases the hard trim would result in some good sequences being over trimmed (and thus failing the HMM check).
Branch updated to sort the output by decreasing abundance, then alphabetically by sequence. This makes the comparison a lot easier.
Discussed with @widdowquinn, this does seem to be an improvement - although it would be interesting to see how searching for and trimming using the expected primer sequence would do in comparison -logged as #45.
In discussion with @widdowquinn about when any synthetic controls in the reads should be dropped, I think we agreed this ought to be as part of
thapbi_pict prepare-reads
(see also #42). i.e. When we give the cleaned up FASTA files tothapbi_pict classify-reads
they should control ITS1 like regions only.In light of that, it makes sense to me to move the use of
hmmscan
and a HMM of ITS1 at the start ofthapbi_pict classify-reads
(currently used in both theidentify
andswarm
methods) to the end ofthapbi_pict prepare-reads
.Would need to apply deduplication after trimming the reads to the HMM matched region (although from a performance point of view, before and after might be faster - note would need to combined the abundance numbers like VSEARCH can do, see also #40)