nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
210 stars 58 forks source link

DOCS: More detailed phylogenetic placement docs #742

Closed corneliusroemer closed 2 months ago

corneliusroemer commented 2 years ago

I've noticed I'm not entirely sure myself how the phylogenetic placement algorithm works exactly and the docs don't contain the details I'm interested in.

Open questions:

To document:

Relevant code is for example here: https://github.com/nextstrain/nextclade/blob/c9ba8c26f60dba45c366cefa43953cbc5fd785c0/packages/nextclade/src/tree/treeFindNearestNodes.cpp#L56-L68

ivan-aksamentov commented 2 years ago

Do we treat tails differently from the rest of the sequence? E.g. the first and last 100 bp? Gut feeling: No

No, I don't think so.

How are Ns in the reference sequence handled? In particular, the equation in the docs only mentions Ns in the query sequence. Is there an implicit false assumption that reference sequences never contain Ns?

I don't think we handle that. Reference sequence is expected to be a high-quality, complete sequence.

Can the (complicated) equation be simplified or at least explained in a simpler way? It seems not quite obvious what is being minimized.

The distance measures (dis-)similarity between a ref node and a query sequence in terms of mutations, and also tries to factor-in missing and ambiguous data.

The formula for the distance you see in the docs is implemented here, just a few lines above what you linked:

https://github.com/nextstrain/nextclade/blob/c9ba8c26f60dba45c366cefa43953cbc5fd785c0/packages/nextclade/src/tree/treeFindNearestNodes.cpp#L48

The comments in this function should help a bit. But it just takes counts of certain events in the sequence and them sums them together in an empirical way. Decisions were made, and it happened to work well in practice.

None of this is absolute, it was just figured out and then refined over time by Richard. As a scientist this is your field of work, so don't hesitate to experiment, and let me know if you see any improvements there. Richard should be able to give some background.

I don't exclude a possibility of introducing multiple distance metrics which can be chosen depending on a dataset or with a runtime flag, if that's helpful.