Open jeromekelleher opened 1 year ago
There's a more complex example that I just linked to for node 251176 in the long ARG:
Here we could do better by picking a different (equally likely) LHS sequence that also matched on the RHS. This more complex case is an HMM path problem, and I'm not sure it can be easily solved by post-hoc path altering? It's an important general case of the LS HMM though, both for Covid and in normal eukaryotic crossovers, so worth thinking through
I disagree - the HMM would have picked up such a sequence had it existed. There's nothing to fix here (see comment in other issue)
Maybe I'm not understanding here, or haven't explained carefully enough. In this case we match on the LHS to P0, which is one of potentially many identical sequences. We then switch to P1 (fine). We then switch again to get the far RHS to get a match against a 3rd parent.
It could be that out of the many first parents we chose, one has exactly the correct pattern after position 26604, right? There would be no way that the HMM would know that a-priori, because it doesn't "know" about non-adjacent parents. However, if there was such a parent, we could have 2 breakpoints but only 2 parents (because we would switch back). That would be more biologically likely, I think?
I don't think it's feasible to condition the search like that.
Switching back to the same parent if it happens to be identical is perfectly reasonable, anything else would involve searching the space of HMM results, which seems very hard.
There's no reason to suppose that a sequence like you suggest exists, anyway.
I don't think it's feasible to condition the search like that.
Yes, it's hard. Not to be done here unless we can think of a hack.
There's no reason to suppose that a sequence like you suggest exists, anyway.
Well, there is very good reason to suspect that we would want to switch back to the same parent. Even more so in normal crossovers or (particularly) gene conversion. So it is a general problem, but not one that I think we can easily tackle here.
FWIW, I expect the majority (or a very large minority) of Covid recombinants to be double switches back to the original parent. So I think it's likely to be quite an important effect.
Just so that I don't forget, I was chatting to Ana, and her ARG threading method can incorporate the back-switching idea, so I wonder if, in the long term, a hybrid of the LS HMM and her approach would pay dividends. I don't know how it scales, however.
By my calculations, there are only 20 recombination nodes out of 169 that would change into 2 parent / 2 breakpoint nodes in the long ARG. These are
140878, 176255, 177256, 244992, 249012, 273611, 305389, 357807, 429464, 456072, 491933, 573905, 583121, 594325, 638775, 670460, 705501, 731707, 757448, 758580
There are 15 nodes that are already of this sort in the long ARG:
135706, 175339, 229998, 231856, 260486, 261290, 298083, 452654, 498182, 537426, 555388, 556749, 697225, 753750, 774861
That leaves 134 that remain 3 parent/2 breakpoint:
93500, 128323, 135669, 136923, 144261, 144441, 149662, 169071, 185365, 192387, 192915, 194916, 195349, 198861, 211505, 213104, 213175, 222034, 227806, 230437, 234889, 240205, 241680, 241682, 242203, 246446, 249811, 251176, 252123, 252965, 254181, 255776, 257272, 257373, 260218, 261335, 263855, 264681, 264755, 265035, 267424, 271473, 277461, 277698, 279029, 280926, 280995, 283074, 283729, 285463, 287161, 288820, 288946, 296149, 299352, 299861, 301377, 301951, 308586, 309252, 326108, 334506, 337930, 341688, 349859, 354326, 355341, 357852, 361754, 364833, 391034, 391984, 392057, 405533, 421698, 435179, 460804, 478100, 480484, 483034, 497473, 497900, 498027, 508244, 517111, 530450, 541641, 545253, 559221, 564299, 566590, 568877, 569627, 569908, 571650, 572266, 572979, 574523, 575138, 576208, 576336, 584355, 584940, 586209, 608848, 625140, 626120, 628203, 628873, 630214, 634757, 636589, 640831, 641233, 651154, 651282, 653664, 653674, 661928, 666230, 705020, 717268, 734394, 738217, 746692, 756061, 759039, 761753, 762252, 763238, 765314, 767874, 774545, 776389
It hardly seems worth the effort changing the code!
I think it's worthwhile, it'll be handy to know these are "switchbacks" for downstream analysis.
The more general argument about exploring the HMM solution space is out of scope for now though, we just don't have the tools.
The more general argument about exploring the HMM solution space is out of scope for now though, we just don't have the tools.
Yes, definitely. Way out of scope. But good for tea-time chat once the preprint is out.
When implementing this, we can probably also jump back to the original parent if the number of mismatches against the original parent in the last segment is the same as the number of mismatches against the current path in that segment. They don't have to be identical.
We could also switch the ID of the first parent to the ID of the third parent if that's more parsimonious, but the logic gets a bit convoluted.
When we have 3 (or more) way recombinants we sometimes jump to different parents when it's equally likely that we could jump back to the original sequence. The HMM is just choosing arbitrarily from the sequences that are available, so we could do some simple checks to see if it's equally likely that we choose one of the existing parents and alter the path accordingly.