Ability to run a specific HMM match again?

jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data

MIT License

4 stars 3 forks source link

Ability to run a specific HMM match again? #155

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

Is there any simple sc2ts function which will take one of the output ARGs (say the long ARG) and match a specific sample into it again, at a specific time (i.e. ignoring younger samples)? This might help to look at the effect of changing the HMM params, or perhaps to find a set of equally likely matching candidates, or even to look at the posterior distribution around a breakpoint.

It would be good to give external users a simple way of doing this. @szhan was asking about it too.

jeromekelleher commented 1 year ago

It's there, you need the alignments though, otherwise you're matching the imputed sequence.

hyanwong commented 1 year ago

Ah, that's great news. I guess we just need some documentation then. Shall I change the title of the issue to reflect that?

szhan commented 1 year ago

A related comment, pretty much what Jerome is saying. I was thinking specifically about this task to find a set of equally likely matching candidates. If we do not match against the nodes in an ARG rather than the original sequences which may contain Ns and/or are masked post-alignment, then the candidates with equal likelihood values may not be truly equally best matching. It would be useful to quickly check against the original aligned sequences, but I suppose that gets costly.

szhan commented 1 year ago

Is this a matter of feeding samples into match https://github.com/jeromekelleher/sc2ts/blob/1443e0fa85125f7018c30eb503f003d340032b13/sc2ts/inference.py#L277?

jeromekelleher commented 1 year ago

These other bits are much more complicated @szhan - I think that's a separate issue.

jeromekelleher commented 1 year ago

Shing's right, you need to feed the samples you're interested in to match

There's some preconditions though:

You need an AlignmentStore (which is big and awkward for full dataset)
The samples are Sample instances created by pulling metadata out of the DB

You could wrap this all up easily enough if you just had a single alignment in a FASTA - this was built to make doing things incrementally quick with a Large dataset. Working with plain FASTAs was too slow, likewise pulling metadata out of a CSV