Closed corneliusroemer closed 2 years ago
The regions
column defines the genomic coordinates attributed to each clade.
strain | clades | parents | intermissions | breakpoints | regions |
---|---|---|---|---|---|
Sample1 | 21J,21K | 2 | 0 | 1 | 0:21618|21J,21762:29742|21K |
Sample2 | 20J,21K,21M | 3 | 0 | 2 | 0:5386|21L,6308:20052|21K,20582:29132|21L&21M |
I'm still working through the code, but I'd be happy to implement this in my fork
Nice proposal @ktmeaton.
I like how you solved the problem that we may not know the position of the breakpoint exactly, and have only lower and upper bounds.
If I understand your proposal correctly, the regions would be the areas that we are sure of, with the unsure region not being part of any region. In this case, the breakpoint would be somewhere between 21618 and 21762:
0:21618|21J,21762:29742|21K
I'm wondering whether there's any extra information that could make the output file more useful.
It could be interesting to include private mutations, mutations that are present with respect to reference but that are in none of the parents. That would help creating covSpectrum queries and thus finding other samples of the same recombinant.
Yes, that's how I would interpret the regions (breakpoint between 21618 and 21762).
Would reporting private mutations generate unique output that Nextclade
doesn't provide? I like to run sc2rf
in tandem with Nextclade
, and then fuse and compare their results.
Yes, it would provide more comprehensive private mutations, because Nextclade will attach to the bigger donor and hence mask some interesting mutations that may occur in the nearest neighbour.
So for example, if we have a AY.39.1 / BA.2 recombinant, Nextclade would show private mutations with respect to AY.39.1 and that would exclude all the AY.39.1 mutations.
sc2rf would report all mutations that are in addition to 21J and BA.2, including the defining AY.39.1 mutations.
Thanks for the explanation, that seems very useful to report as well!
Would it be better to have an option that switches between screen and CSV file outputs, or a flag that specifies a path to optionally write CSV output?
show_matches
where some things you want reported like intermissions are being calculatedOkay I've got the beginnings of CSV output:
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv
Potential recombinants between ['Omicron / BA.1 / 21K', 'Omicron / BA.2 / 21L']:
coordinates 1111111111111222222222222222222222222222222222222222222222222222222
223445899990000123455789011112222222222222333333333333344445566666677777888889
26780133334580144581427419067892566666778899000000245568914450502557823338238885
47933828942362944389041165516480777788781899144567002905432600867370558880718881
10027416344469879705804035582670834968563225308535235944804930400079892347111230
genes 1a 1b S 3aEM 6 7b9N
ref CTCACGCTGCACCCCGCACTCCCCACACCCGTGTCTCAGAGTGCAAGAATCACTCCGCATCCCCCACGCAGATCACGGGA
Omicron / BA.1 / 21K T••GT••GA••••T••AG•CTT••G•••TT••ACTCT•••••AACGAGTCAGTGAATATATTT•TGGA•C•••TTTAAC•
Omicron / BA.2 / 21L TGT•TAT••TGTTTTAA•T•T•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC
2022-05483 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05506 T••GT••GA••••T••AG•CT•TTGTGT••AGANNNNNACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05508 T••GT••GA••••T••AG•CT•TTGNNT••AGN•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05526 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05535 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05565 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAATNTA•TTTTNNATCCTCTTTAACC 1 BP
2022-05568 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05602 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
made with Sc2rf - available at https://github.com/lenaschimmel/sc2rf
art@Wernstrom sc2rf % head temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05506,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05508,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05526,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05535,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05565,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05568,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05602,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
obviously haven't implemented regions
yet
https://github.com/ArtPoon/sc2rf/commit/6cdec7d360ee840a225252b6ac7352b92f4f9a10
Now reporting regions and private mutations in CSV: https://github.com/ArtPoon/sc2rf/commit/6c97ca540db89678ba90e655b5920da303634174
sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"
issued PR #25
Thank you a lot for your PR and for putting up with my uncommented, chaotic code!
I just tested it and noticed that the format of the regions is different from the one @ktmeaton and @corneliusroemer discussed above. Is this a deliberate choice? I think I would prefer their version.
I was just about to implement that change myself, but realized it might be better to discuss it before.
I would prefer their version too, but I reached the end of the day and I'm supposed to be reading someone's thesis :-) If you can change the format, that would be great! And thanks @lenaschimmel for making this great and open resource. I'm seriously impressed you've put it together so quickly, and I love the retro command-line output.
Ok, I can try to change the format, but I can't promise that I'll find time for it (even it's only about 20 minutes, I guess) before Tuesday evening.
Ah it's ok, I'll work on it - I just got the impression from your earlier comment that you were already about to implement a format change!
@ktmeaton - your example indicates that the last column should report regions that match more than one example:
0:5386|21L,6308:20052|21K,20582:29132|21L&21M
Incorporating multiple matches in a given region is going to take a bit more work because we'll need to record this information under the condition len(matching_exs) < len(examples)
, and prev_definitive_match
is only tracking a single example. In other words I'd have to modify some of the data structures or logic in show_matches
, which is more likely to break something.
Actually cases where a region matches multiple examples are not even registered as breakpoints in the current code. If the list matching_exs
transitions from [ex1]
to [ex2]
(where exn is n-th example), this is registered as a breakpoint, but what if we transition from [ex1]
to [ex2, ex3]
? Right now the only action in the latter case is:
https://github.com/lenaschimmel/sc2rf/blob/c96954b95185c509e117a995c2d626474bbfb003/sc2rf.py#L671-L673
Is this an edge case that we can ignore, or should this also be considered a breakpoint that is not marked in the screen output, but recorded in the CSV?
In my view, recombinants of more than 2 parents are very rare and this is an edge case that could be ignored for now.
Could always be extended later on?
Triple recombinants are rather unlikely for now I think.
I've almost got it, though :-)
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05506,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05508,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05526,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05535,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05565,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05568,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05602,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv --show-private-mutations > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"
just need a test case to confirm that all examples are reported when two or more match a given region
@ktmeaton @corneliusroemer - while generating test cases, I realized that not all examples have a NextstrainClade
or PangoLineage
. This means that it is not possible to write outputs with just NextstrainClade
as requested. Is it okay if the CSV uses the full names?
@ArtPoon can you explain what you mean by "not all examples have a NextstrainClade or PangoLineage", I don't quite understand. Can you give an example of a full name?
> require(jsonlite)
> vprop <- read_json("virus_properties.json", simplifyVector=TRUE)
> vprop$variants[c("name", "NextstrainClade", "PangoLineage")]
name NextstrainClade PangoLineage
1 Alpha / B.1.1.7 / 20I 20I B.1.1.7
2 Beta / B.1.351 / 20H 20H B.1.351
3 Gamma / P.1 / 20J 20J P.1
4 Delta / B.1.617.2 / 21A 21A B.1.617.2
5 Delta / 21I 21I
6 Delta / 21J 21J
7 Delta / AY.4 AY.4
8 Delta / AY.103 AY.103
9 Delta / AY.43 AY.43
10 Delta / AY.44 AY.44
11 Delta / AY.122 AY.122
12 Delta / AY.3 AY.3
13 Delta / AY.4.2 AY.4.2
14 Delta / AY.100 AY.100
15 Delta / AY.25.1 AY.25.1
16 Delta / AY.25 AY.25
17 Kappa / B.1.617.1 / 21B 21B B.1.617.1
18 Epsilon / B.1.427 / B.1.429 / 21C 21C B.1.427 / B.1.429
19 Eta / B.1.525 / 21D 21D B.1.525
20 Iota / B.1.526 / 21F 21F B.1.526
21 Lambda / C.37 / 21G 21G C.37
22 Mu / B.1.621 / 21H 21H B.1.621
23 Omicron / BA.1 / 21K 21K BA.1
24 Omicron / BA.2 / 21L 21L BA.2
25 Omicron / BA.3 BA.3
26 Omicron / B.1.1.529 / 21M 21M B.1.1.529
27 B.1.177 / 20E 20E B.1.177
28 B.1.1.519 / 20B/S:732A 20B/S:732A B.1.1.519
29 B.1.620 / 20A/S:126A 20A/S:126A B.1.620
30 B.1.160 / 20A.EU2 20A.EU2 B.1.160
31 B.1.258 / 20A/S:439K 20A/S:439K B.1.258
32 B.1.221 / 20A/S:98F 20A/S:98F B.1.221
33 B.1.367 / 20C/S:80Y 20C/S:80Y B.1.367
34 B.1.1.277 / 20B/S:626S 20B/S:626S B.1.1.277
35 B.1.1.302 / 20B/S:1122L 20B/S:1122L B.1.1.302
@corneliusroemer wrote:
In my view, recombinants of more than 2 parents are very rare and this is an edge case that could be ignored for now.
Could always be extended later on?
Triple recombinants are rather unlikely for now I think.
Yes, I believe that triple recombinants will occur very rarely in reality - even though we might encounter recombinants where one of the two parents is already a recombinant, and maybe one that has not already been assigned an X...
pango lineage, so they may appear like triple recombinants. Let's just hope that all recombinants will be assigned before they spread far enough to recombine once again...
@ArtPoon wrote:
Is this an edge case that we can ignore, or should this also be considered a breakpoint that is not marked in the screen output, but recorded in the CSV?
I think it is okay to ignore this edge case for now, and not generate the proper output for the affected samples.
The final result of analyzing a recombinant sample will most likely show two parents and a small number of breakpoints.
Situations with more than two parents are not very relevant as final results, but might become important as intermediate results. Sometimes, Sc2rf will propose three or more potential parents when it can't be very sure about which two parents are the actual ones. I think it can be useful to show this to the user, so that they can decide it manually, or using some other tool. In the mid-future, I want to add another output mode to Sc2rf, so that the visual display of 3 or more potential parents is much more useful that it currently is. Once this is working, we might want to reduce the filtering that Sc2rf currently applies to avoid three-parent-hypothesis during analysing. Pherhaps, Sc2rf might even do something like "These are all potential parents for this sample with more than 5% likelyhood" and then list 7 parents, some of which are very similar to each other.
On the one hand, saving those unclear intermediate results to a file so that some other tool can try to make a better decision. On the other hand, I'm unsure if the current file format is a good approach to capture these unclear results, or if we would need another, more complex format anyway.
But let's not worry too much about that future now.
@lenaschimmel - thanks for the detailed response! I agree that it would be useful to indicate to the user when matches are ambiguous (with multiple parents).
Regarding my PR (#25), I am going to revise the labels to use the full name
rather than just NextstrainClade
and I think that should address this issue (unless I hear otherwise from @ktmeaton or @corneliusroemer). Will that be okay with you?
Absolutely!
Ok, pushed the above change. CSV now looks like this:
sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"
Right now the output is good for interactive human analysis, but there's a lack of csv/tsv machine readable output for sharing/further analysis.
From my experience with Nextclade, main difficulty here is the design of the specs of the file, which columns to include etc, which separators to take if you need an intra-column separator etc.
Maybe best to discuss on this issue before implementing something as one will kind of get locked in to the format.