peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Ought to be using nhmmscan over hmmscan? #152

Closed peterjc closed 3 years ago

peterjc commented 5 years ago

HMMER v3.1 introduced specialised command line tools nhmmer/nhmmscan like hmmer/hmmscan but focused on DNA. Quoting the manual http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

DNA sequence comparison. HMMER now includes tools that are specifically designed for DNA/DNA comparison: nhmmer and nhmmscan. The most notable improvement over using HMMER3’s tools is the ability to search long (e.g. chromosome length) target sequences.

Then later,

Modifications to the pipeline as used for DNA search.

SSV, not MSV.

In the MSV filter, one or more high-scoring ungapped segments contribute to a score that, if sufficiently high, casues the entire sequence to be passed on to the next stage (the bias filter). This strategy won’t work for long DNA sequences; it doesn’t filter the human genome much to say “there’s a hit on chromosome 1, now process the whole thing”. In the scanning-SSV (“Single ungapped Segment Viterbi”) algorithm used in nhmmer and nhmmscan, each comparison between a query and target is scanned for high-scoring ungapped alignment segments, and a window around each such segment is extracted, merging overlapping windows. Each window is then passed on to the remaining filter cascade, where it is (for the most part) treated as described above. As with the MSV filter, the default P-value threshold is 0.02, and can be controlled with the --F1 flag. The --max flag also controls the amount of the sequence database that passes the SSV filter, but instead of the threshold being set to 1.0, as described for the protein pipeline, it is set to 0.4.

The above does not seem relevant given the length of our Illumina sequenced ITS1 sequences.

There are no domains, but there are envelopes.

In HMMER’s protein-search programs, multiple matches of the model to a target sequence are treated as domains contained within a single hit for that sequence. In the DNA-search programs, each match of the model to a subsequence is treated as an independent hit - there’s no notion of a domain. This is largely a difference in reporting; both pipelines rely on essentially the same envelope detection code; envelopes lead to domains in protein search, and hits in DNA search.

We are not interested in domains - we already impose a minimum length on the HMM matches.

This difference in reporting has practical downsides for the parsing: https://github.com/biopython/biopython/issues/2139

Biased composition.

DNA sequence is littered with regions containing tandem simple repeats or other low complexity sequence. Without accounting for such composition bias, we see many cases in which one part of a hit is obviously legitimate, and serves as the anchor for a neighboring alignment segment that is clearly low-complexity garbage, one form of a problem known as homologous overextension (Gonzalez and Pearson, 2010). The null2 method used in protein search delays score modification until after the alignment is complete, but we know that this kind of overextension can be (mostly) avoided if the model’s log odds scores account for the composition bias of the target region while constructing the alignment. The DNA search pipeline therefore does just this: it modifies the scoring scheme for each target envelope as a function of that envelope’s sequence composition, then builds the alignment according to that scheme.

This may be relevant for some of the ITS1 sequences, would have to try it and see.

It would be interesting to see what difference switching from hmmscan to nhmmscan makes, but there is some ground work needed first on the parsing framework.

peterjc commented 3 years ago

Obsolete as of v0.9.3 where we dropped the last use of hmmer3, see #321