Apply ITS1 filter at end of prepare-reads

peterjc commented 5 years ago

In discussion with @widdowquinn about when any synthetic controls in the reads should be dropped, I think we agreed this ought to be as part of thapbi_pict prepare-reads (see also #42). i.e. When we give the cleaned up FASTA files to thapbi_pict classify-reads they should control ITS1 like regions only.

In light of that, it makes sense to me to move the use of hmmscan and a HMM of ITS1 at the start of thapbi_pict classify-reads (currently used in both the identify and swarm methods) to the end of thapbi_pict prepare-reads.

Would need to apply deduplication after trimming the reads to the HMM matched region (although from a performance point of view, before and after might be faster - note would need to combined the abundance numbers like VSEARCH can do, see also #40)

peterjc commented 5 years ago

This would/could also replace the current hard coded trimming done in thapbi_pict classify-reads based on the expected position of the ITS1 based on the amplicon primers.

peterjc commented 5 years ago

Currently,

$ time thapbi_pict prepare-reads -o /tmp/cut tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz -v
Preparing 1 paired FASTQ files
DEBUG: Temp folder of DNAMIX_S95_L001 is /tmp/tmp_6cf6tku
Calling command: trimmomatic PE -phred33 tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz /tmp/tmp_6cf6tku/trimmomatic_R1.fastq /dev/null /tmp/tmp_6cf6tku/trimmomatic_R2.fastq /dev/null ILLUMINACLIP:/mnt/shared/scratch/synology/pc40583/apps/conda/bin/../share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10
Calling command: pear -f /tmp/tmp_6cf6tku/trimmomatic_R1.fastq -r /tmp/tmp_6cf6tku/trimmomatic_R2.fastq -o /tmp/tmp_6cf6tku/pear
921 unique sequences
Prepared DNAMIX_S95_L001 with 921 unique reads

real    0m6.930s
user    0m6.294s
sys 0m0.418s

With hmmscan used to filter and trim to the ITS1 region, rather than hard coded left/right cut:

$ time thapbi_pict prepare-reads -o /tmp/hmm tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz -v
Preparing 1 paired FASTQ files
DEBUG: Temp folder of DNAMIX_S95_L001 is /tmp/tmpxqv77ugk
Calling command: trimmomatic PE -phred33 tests/reads/DNAMIX_S95_L001_R1_001.fastq.gz tests/reads/DNAMIX_S95_L001_R2_001.fastq.gz /tmp/tmpxqv77ugk/trimmomatic_R1.fastq /dev/null /tmp/tmpxqv77ugk/trimmomatic_R2.fastq /dev/null ILLUMINACLIP:/mnt/shared/scratch/synology/pc40583/apps/conda/bin/../share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10
Calling command: pear -f /tmp/tmpxqv77ugk/trimmomatic_R1.fastq -r /tmp/tmpxqv77ugk/trimmomatic_R2.fastq -o /tmp/tmpxqv77ugk/pear
Merged paired FASTQ reads into 1251 unique sequences
Cropped down to 892 unique ITS1 sequences
Prepared DNAMIX_S95_L001 with 892 unique reads

real    0m23.578s
user    0m13.025s
sys 0m1.863s

These are representative times from three repeats - it is slower as expected, but not yet a problem.

The output is very similar, here looking at the most abundant entries:

$ grep "^>"  /tmp/cut/DNAMIX_S95_L001.prepared.fasta | awk -F"_" '{print $2$1}' | sort -n | tail
13>a935bc95fa4748d56d76bdd91c534df5
37>a26540b661da0669ed602b193bb52b49
78>e8aa0b53937b895bcbd83b1ce87aad64
264>9577aa1a6c2da294ebbfa5c84cff65aa
294>2b8e7a64e493e5c191e92f21596b4256
296>6803c088eb1e53bb2e44bffecbaa0fdd
299>5122dde24762f8e3d6a54e3f79077254
385>0368431c779b6d3b2555f68f90d047c5
819>2ba87367bdbb87cc37521bed773ffa37
967>9e8f051c64c2b9cc3b6fcb27559418ca

$ grep "^>"  /tmp/hmm/DNAMIX_S95_L001.prepared.fasta | awk -F"_" '{print $2$1}' | sort -n | tail
13>a935bc95fa4748d56d76bdd91c534df5
37>a26540b661da0669ed602b193bb52b49
83>e8aa0b53937b895bcbd83b1ce87aad64
267>9577aa1a6c2da294ebbfa5c84cff65aa
298>6803c088eb1e53bb2e44bffecbaa0fdd
304>2b8e7a64e493e5c191e92f21596b4256
304>5122dde24762f8e3d6a54e3f79077254
395>0368431c779b6d3b2555f68f90d047c5
829>2ba87367bdbb87cc37521bed773ffa37
988>9e8f051c64c2b9cc3b6fcb27559418ca

Looking at the most abundant entry in both cases, and doing a grep (easy since the FASTA files currently do not use line wrapping) there is a single match for the HMM trimmed output, but for the fix cutting:

$ grep "CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT" /tmp/cut/DNAMIX_S95_L001.prepared.fasta -B 1
>3cf14606c7fd0a5cdd0588566fe69ddd_5
ACCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT
--
>1d2301ab217de092a8f855cfec185c1f_1
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACTGTGGGGACGAAAGTCCTTGCTCTCTTATACACCATCCTCCG
--
>aee62a912e607666a8cf759ed1d26df6_1
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACTG
--
>9e8f051c64c2b9cc3b6fcb27559418ca_967
CCACACCTAAAAAACTTTTCACGTGAACCGTATCAACCCTTTTAGTTGGGGGTCTTGCTTTTTTTGCGAGCCCTATCATGGCGAATGTTTGGACTTCGGTCTGGGCTAGTAGCTTTTTGTTTTAAACCCATTCTACAATACTGATTATACT

i.e. With a leading A 5 times, with tailing GTGGGGACGAAAGTCCTTGCTCTCTTATACACCATCCTCCG once, and with a tailing G once, and the exact same sequence 967 times.

Note 5 + 1 + 1 + 967 = 974, but with the HMM we reported a higher abundance of 988.

This suggests in some cases the hard trim would result in some good sequences being over trimmed (and thus failing the HMM check).

peterjc commented 5 years ago

Branch updated to sort the output by decreasing abundance, then alphabetically by sequence. This makes the comparison a lot easier.

peterjc commented 5 years ago

Discussed with @widdowquinn, this does seem to be an improvement - although it would be interesting to see how searching for and trimming using the expected primer sequence would do in comparison -logged as #45.

peterjc / thapbi-pict

Apply ITS1 filter at end of prepare-reads #43