jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Add same-parent parsimony step for recombinants #101

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

When we have 3 (or more) way recombinants we sometimes jump to different parents when it's equally likely that we could jump back to the original sequence. The HMM is just choosing arbitrarily from the sequences that are available, so we could do some simple checks to see if it's equally likely that we choose one of the existing parents and alter the path accordingly.

hyanwong commented 1 year ago

There's a more complex example that I just linked to for node 251176 in the long ARG:

Screenshot 2023-03-09 at 12 18 36

Here we could do better by picking a different (equally likely) LHS sequence that also matched on the RHS. This more complex case is an HMM path problem, and I'm not sure it can be easily solved by post-hoc path altering? It's an important general case of the LS HMM though, both for Covid and in normal eukaryotic crossovers, so worth thinking through

jeromekelleher commented 1 year ago

I disagree - the HMM would have picked up such a sequence had it existed. There's nothing to fix here (see comment in other issue)

hyanwong commented 1 year ago

Maybe I'm not understanding here, or haven't explained carefully enough. In this case we match on the LHS to P0, which is one of potentially many identical sequences. We then switch to P1 (fine). We then switch again to get the far RHS to get a match against a 3rd parent.

It could be that out of the many first parents we chose, one has exactly the correct pattern after position 26604, right? There would be no way that the HMM would know that a-priori, because it doesn't "know" about non-adjacent parents. However, if there was such a parent, we could have 2 breakpoints but only 2 parents (because we would switch back). That would be more biologically likely, I think?

jeromekelleher commented 1 year ago

I don't think it's feasible to condition the search like that.

Switching back to the same parent if it happens to be identical is perfectly reasonable, anything else would involve searching the space of HMM results, which seems very hard.

There's no reason to suppose that a sequence like you suggest exists, anyway.

hyanwong commented 1 year ago

I don't think it's feasible to condition the search like that.

Yes, it's hard. Not to be done here unless we can think of a hack.

There's no reason to suppose that a sequence like you suggest exists, anyway.

Well, there is very good reason to suspect that we would want to switch back to the same parent. Even more so in normal crossovers or (particularly) gene conversion. So it is a general problem, but not one that I think we can easily tackle here.

hyanwong commented 1 year ago

FWIW, I expect the majority (or a very large minority) of Covid recombinants to be double switches back to the original parent. So I think it's likely to be quite an important effect.

hyanwong commented 1 year ago

Just so that I don't forget, I was chatting to Ana, and her ARG threading method can incorporate the back-switching idea, so I wonder if, in the long term, a hybrid of the LS HMM and her approach would pay dividends. I don't know how it scales, however.

hyanwong commented 1 year ago

By my calculations, there are only 20 recombination nodes out of 169 that would change into 2 parent / 2 breakpoint nodes in the long ARG. These are

140878, 176255, 177256, 244992, 249012, 273611, 305389, 357807, 429464, 456072, 491933, 573905, 583121, 594325, 638775, 670460, 705501, 731707, 757448, 758580

There are 15 nodes that are already of this sort in the long ARG:

135706, 175339, 229998, 231856, 260486, 261290, 298083, 452654, 498182, 537426, 555388, 556749, 697225, 753750, 774861

That leaves 134 that remain 3 parent/2 breakpoint:

93500, 128323, 135669, 136923, 144261, 144441, 149662, 169071, 185365, 192387, 192915, 194916, 195349, 198861, 211505, 213104, 213175, 222034, 227806, 230437, 234889, 240205, 241680, 241682, 242203, 246446, 249811, 251176, 252123, 252965, 254181, 255776, 257272, 257373, 260218, 261335, 263855, 264681, 264755, 265035, 267424, 271473, 277461, 277698, 279029, 280926, 280995, 283074, 283729, 285463, 287161, 288820, 288946, 296149, 299352, 299861, 301377, 301951, 308586, 309252, 326108, 334506, 337930, 341688, 349859, 354326, 355341, 357852, 361754, 364833, 391034, 391984, 392057, 405533, 421698, 435179, 460804, 478100, 480484, 483034, 497473, 497900, 498027, 508244, 517111, 530450, 541641, 545253, 559221, 564299, 566590, 568877, 569627, 569908, 571650, 572266, 572979, 574523, 575138, 576208, 576336, 584355, 584940, 586209, 608848, 625140, 626120, 628203, 628873, 630214, 634757, 636589, 640831, 641233, 651154, 651282, 653664, 653674, 661928, 666230, 705020, 717268, 734394, 738217, 746692, 756061, 759039, 761753, 762252, 763238, 765314, 767874, 774545, 776389

It hardly seems worth the effort changing the code!

jeromekelleher commented 1 year ago

I think it's worthwhile, it'll be handy to know these are "switchbacks" for downstream analysis.

jeromekelleher commented 1 year ago

The more general argument about exploring the HMM solution space is out of scope for now though, we just don't have the tools.

hyanwong commented 1 year ago

The more general argument about exploring the HMM solution space is out of scope for now though, we just don't have the tools.

Yes, definitely. Way out of scope. But good for tea-time chat once the preprint is out.

hyanwong commented 1 year ago

When implementing this, we can probably also jump back to the original parent if the number of mismatches against the original parent in the last segment is the same as the number of mismatches against the current path in that segment. They don't have to be identical.

We could also switch the ID of the first parent to the ID of the third parent if that's more parsimonious, but the logic gets a bit convoluted.