jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Ability to run a specific HMM match again? #155

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

Is there any simple sc2ts function which will take one of the output ARGs (say the long ARG) and match a specific sample into it again, at a specific time (i.e. ignoring younger samples)? This might help to look at the effect of changing the HMM params, or perhaps to find a set of equally likely matching candidates, or even to look at the posterior distribution around a breakpoint.

It would be good to give external users a simple way of doing this. @szhan was asking about it too.

jeromekelleher commented 1 year ago

It's there, you need the alignments though, otherwise you're matching the imputed sequence.

hyanwong commented 1 year ago

Ah, that's great news. I guess we just need some documentation then. Shall I change the title of the issue to reflect that?

szhan commented 1 year ago

A related comment, pretty much what Jerome is saying. I was thinking specifically about this task to find a set of equally likely matching candidates. If we do not match against the nodes in an ARG rather than the original sequences which may contain Ns and/or are masked post-alignment, then the candidates with equal likelihood values may not be truly equally best matching. It would be useful to quickly check against the original aligned sequences, but I suppose that gets costly.

szhan commented 1 year ago

Is this a matter of feeding samples into match https://github.com/jeromekelleher/sc2ts/blob/1443e0fa85125f7018c30eb503f003d340032b13/sc2ts/inference.py#L277?

jeromekelleher commented 1 year ago

These other bits are much more complicated @szhan - I think that's a separate issue.

jeromekelleher commented 1 year ago

Shing's right, you need to feed the samples you're interested in to match

There's some preconditions though:

You could wrap this all up easily enough if you just had a single alignment in a FASTA - this was built to make doing things incrementally quick with a Large dataset. Working with plain FASTAs was too slow, likewise pulling metadata out of a CSV