ENH: provide output optionally as csv/tsv for automated analysis/sharing

corneliusroemer commented 2 years ago

Right now the output is good for interactive human analysis, but there's a lack of csv/tsv machine readable output for sharing/further analysis.

From my experience with Nextclade, main difficulty here is the design of the specs of the file, which columns to include etc, which separators to take if you need an intra-column separator etc.

Maybe best to discuss on this issue before implementing something as one will kind of get locked in to the format.

ktmeaton commented 2 years ago

Here's an example of tabular structure that I would find very useful (made up data)!
I've tried to use similar delimiters as Nextclade does.
The regions column defines the genomic coordinates attributed to each clade.

strain clades parents intermissions breakpoints regions

Sample1 21J,21K 2 0 1 0:21618|21J,21762:29742|21K

Sample2 20J,21K,21M 3 0 2 0:5386|21L,6308:20052|21K,20582:29132|21L&21M

strain	clades	parents	intermissions	breakpoints	regions
Sample1	21J,21K	2	0	1	0:21618\|21J,21762:29742\|21K
Sample2	20J,21K,21M	3	0	2	0:5386\|21L,6308:20052\|21K,20582:29132\|21L&21M

ArtPoon commented 2 years ago

I'm still working through the code, but I'd be happy to implement this in my fork

corneliusroemer commented 2 years ago

Nice proposal @ktmeaton.

I like how you solved the problem that we may not know the position of the breakpoint exactly, and have only lower and upper bounds.

If I understand your proposal correctly, the regions would be the areas that we are sure of, with the unsure region not being part of any region. In this case, the breakpoint would be somewhere between 21618 and 21762:

0:21618|21J,21762:29742|21K

I'm wondering whether there's any extra information that could make the output file more useful.

It could be interesting to include private mutations, mutations that are present with respect to reference but that are in none of the parents. That would help creating covSpectrum queries and thus finding other samples of the same recombinant.

ktmeaton commented 2 years ago

Yes, that's how I would interpret the regions (breakpoint between 21618 and 21762).

Would reporting private mutations generate unique output that Nextclade doesn't provide? I like to run sc2rf in tandem with Nextclade, and then fuse and compare their results.

corneliusroemer commented 2 years ago

Yes, it would provide more comprehensive private mutations, because Nextclade will attach to the bigger donor and hence mask some interesting mutations that may occur in the nearest neighbour.

So for example, if we have a AY.39.1 / BA.2 recombinant, Nextclade would show private mutations with respect to AY.39.1 and that would exclude all the AY.39.1 mutations.

sc2rf would report all mutations that are in addition to 21J and BA.2, including the defining AY.39.1 mutations.

ktmeaton commented 2 years ago

Thanks for the explanation, that seems very useful to report as well!

ArtPoon commented 2 years ago

Would it be better to have an option that switches between screen and CSV file outputs, or a flag that specifies a path to optionally write CSV output?

ArtPoon commented 2 years ago

it's looking like it would be easier to tuck CSV writing into the show_matches where some things you want reported like intermissions are being calculated
this would mean the CSV rows would be structured by tuples of example indices (combinations of variants) in the outer loop, and then iterate over samples
if it's annoying for me to be commenting here let me know, I'll just carry over to my fork

ArtPoon commented 2 years ago

Okay I've got the beginnings of CSV output:

art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv

Potential recombinants between ['Omicron / BA.1 / 21K', 'Omicron / BA.2 / 21L']:

coordinates                                 1111111111111222222222222222222222222222222222222222222222222222222
                                 223445899990000123455789011112222222222222333333333333344445566666677777888889
                               26780133334580144581427419067892566666778899000000245568914450502557823338238885
                               47933828942362944389041165516480777788781899144567002905432600867370558880718881
                               10027416344469879705804035582670834968563225308535235944804930400079892347111230

genes                           1a                 1b     S                                  3aEM   6   7b9N    
ref                            CTCACGCTGCACCCCGCACTCCCCACACCCGTGTCTCAGAGTGCAAGAATCACTCCGCATCCCCCACGCAGATCACGGGA

Omicron / BA.1 / 21K           T••GT••GA••••T••AG•CTT••G•••TT••ACTCT•••••AACGAGTCAGTGAATATATTT•TGGA•C•••TTTAAC•
Omicron / BA.2 / 21L           TGT•TAT••TGTTTTAA•T•T•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC

2022-05483                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05506                     T••GT••GA••••T••AG•CT•TTGTGT••AGANNNNNACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05508                     T••GT••GA••••T••AG•CT•TTGNNT••AGN•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05526                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05535                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05565                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAATNTA•TTTTNNATCCTCTTTAACC 1 BP
2022-05568                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05602                     T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP

made with Sc2rf - available at https://github.com/lenaschimmel/sc2rf

art@Wernstrom sc2rf % head temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05506,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05508,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05526,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05535,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05565,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05568,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05602,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,

obviously haven't implemented regions yet https://github.com/ArtPoon/sc2rf/commit/6cdec7d360ee840a225252b6ac7352b92f4f9a10

ArtPoon commented 2 years ago

Now reporting regions and private mutations in CSV: https://github.com/ArtPoon/sc2rf/commit/6c97ca540db89678ba90e655b5920da303634174

sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"670|Omicron/BA.1/21K,15240|Omicron/BA.2/21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"

ArtPoon commented 2 years ago

issued PR #25

lenaschimmel commented 2 years ago

Thank you a lot for your PR and for putting up with my uncommented, chaotic code!

I just tested it and noticed that the format of the regions is different from the one @ktmeaton and @corneliusroemer discussed above. Is this a deliberate choice? I think I would prefer their version.

I was just about to implement that change myself, but realized it might be better to discuss it before.

ArtPoon commented 2 years ago

I would prefer their version too, but I reached the end of the day and I'm supposed to be reading someone's thesis :-) If you can change the format, that would be great! And thanks @lenaschimmel for making this great and open resource. I'm seriously impressed you've put it together so quickly, and I love the retro command-line output.

lenaschimmel commented 2 years ago

Ok, I can try to change the format, but I can't promise that I'll find time for it (even it's only about 20 minutes, I guess) before Tuesday evening.

ArtPoon commented 2 years ago

Ah it's ok, I'll work on it - I just got the impression from your earlier comment that you were already about to implement a format change!

ArtPoon commented 2 years ago

@ktmeaton - your example indicates that the last column should report regions that match more than one example: 0:5386|21L,6308:20052|21K,20582:29132|21L&21M

Incorporating multiple matches in a given region is going to take a bit more work because we'll need to record this information under the condition len(matching_exs) < len(examples), and prev_definitive_match is only tracking a single example. In other words I'd have to modify some of the data structures or logic in show_matches, which is more likely to break something.

ArtPoon commented 2 years ago

Actually cases where a region matches multiple examples are not even registered as breakpoints in the current code. If the list matching_exs transitions from [ex1] to [ex2] (where exn is n-th example), this is registered as a breakpoint, but what if we transition from [ex1] to [ex2, ex3]? Right now the only action in the latter case is: https://github.com/lenaschimmel/sc2rf/blob/c96954b95185c509e117a995c2d626474bbfb003/sc2rf.py#L671-L673

Is this an edge case that we can ignore, or should this also be considered a breakpoint that is not marked in the screen output, but recorded in the CSV?

corneliusroemer commented 2 years ago

In my view, recombinants of more than 2 parents are very rare and this is an edge case that could be ignored for now.

Could always be extended later on?

Triple recombinants are rather unlikely for now I think.

ArtPoon commented 2 years ago

I've almost got it, though :-)

ArtPoon commented 2 years ago

art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05506,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05508,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05526,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05535,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05565,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05568,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05602,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv --show-private-mutations > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"

just need a test case to confirm that all examples are reported when two or more match a given region

ArtPoon commented 2 years ago

@ktmeaton @corneliusroemer - while generating test cases, I realized that not all examples have a NextstrainClade or PangoLineage. This means that it is not possible to write outputs with just NextstrainClade as requested. Is it okay if the CSV uses the full names?

corneliusroemer commented 2 years ago

@ArtPoon can you explain what you mean by "not all examples have a NextstrainClade or PangoLineage", I don't quite understand. Can you give an example of a full name?

ArtPoon commented 2 years ago

> require(jsonlite)
> vprop <- read_json("virus_properties.json", simplifyVector=TRUE)
> vprop$variants[c("name", "NextstrainClade", "PangoLineage")]
                                name NextstrainClade      PangoLineage
1              Alpha / B.1.1.7 / 20I             20I           B.1.1.7
2               Beta / B.1.351 / 20H             20H           B.1.351
3                  Gamma / P.1 / 20J             20J               P.1
4            Delta / B.1.617.2 / 21A             21A         B.1.617.2
5                        Delta / 21I             21I                  
6                        Delta / 21J             21J                  
7                       Delta / AY.4                              AY.4
8                     Delta / AY.103                            AY.103
9                      Delta / AY.43                             AY.43
10                     Delta / AY.44                             AY.44
11                    Delta / AY.122                            AY.122
12                      Delta / AY.3                              AY.3
13                    Delta / AY.4.2                            AY.4.2
14                    Delta / AY.100                            AY.100
15                   Delta / AY.25.1                           AY.25.1
16                     Delta / AY.25                             AY.25
17           Kappa / B.1.617.1 / 21B             21B         B.1.617.1
18 Epsilon / B.1.427 / B.1.429 / 21C             21C B.1.427 / B.1.429
19               Eta / B.1.525 / 21D             21D           B.1.525
20              Iota / B.1.526 / 21F             21F           B.1.526
21               Lambda / C.37 / 21G             21G              C.37
22                Mu / B.1.621 / 21H             21H           B.1.621
23              Omicron / BA.1 / 21K             21K              BA.1
24              Omicron / BA.2 / 21L             21L              BA.2
25                    Omicron / BA.3                              BA.3
26         Omicron / B.1.1.529 / 21M             21M         B.1.1.529
27                     B.1.177 / 20E             20E           B.1.177
28            B.1.1.519 / 20B/S:732A      20B/S:732A         B.1.1.519
29              B.1.620 / 20A/S:126A      20A/S:126A           B.1.620
30                 B.1.160 / 20A.EU2         20A.EU2           B.1.160
31              B.1.258 / 20A/S:439K      20A/S:439K           B.1.258
32               B.1.221 / 20A/S:98F       20A/S:98F           B.1.221
33               B.1.367 / 20C/S:80Y       20C/S:80Y           B.1.367
34            B.1.1.277 / 20B/S:626S      20B/S:626S         B.1.1.277
35           B.1.1.302 / 20B/S:1122L     20B/S:1122L         B.1.1.302

ArtPoon commented 2 years ago

Also see https://github.com/lenaschimmel/sc2rf/blob/main/mapping.csv

lenaschimmel commented 2 years ago

@corneliusroemer wrote:

In my view, recombinants of more than 2 parents are very rare and this is an edge case that could be ignored for now.

Could always be extended later on?

Triple recombinants are rather unlikely for now I think.

Yes, I believe that triple recombinants will occur very rarely in reality - even though we might encounter recombinants where one of the two parents is already a recombinant, and maybe one that has not already been assigned an X... pango lineage, so they may appear like triple recombinants. Let's just hope that all recombinants will be assigned before they spread far enough to recombine once again...

@ArtPoon wrote:

Is this an edge case that we can ignore, or should this also be considered a breakpoint that is not marked in the screen output, but recorded in the CSV?

I think it is okay to ignore this edge case for now, and not generate the proper output for the affected samples.

The case for results with many parents

The final result of analyzing a recombinant sample will most likely show two parents and a small number of breakpoints.

Situations with more than two parents are not very relevant as final results, but might become important as intermediate results. Sometimes, Sc2rf will propose three or more potential parents when it can't be very sure about which two parents are the actual ones. I think it can be useful to show this to the user, so that they can decide it manually, or using some other tool. In the mid-future, I want to add another output mode to Sc2rf, so that the visual display of 3 or more potential parents is much more useful that it currently is. Once this is working, we might want to reduce the filtering that Sc2rf currently applies to avoid three-parent-hypothesis during analysing. Pherhaps, Sc2rf might even do something like "These are all potential parents for this sample with more than 5% likelyhood" and then list 7 parents, some of which are very similar to each other.

On the one hand, saving those unclear intermediate results to a file so that some other tool can try to make a better decision. On the other hand, I'm unsure if the current file format is a good approach to capture these unclear results, or if we would need another, more complex format anyway.

But let's not worry too much about that future now.

ArtPoon commented 2 years ago

@lenaschimmel - thanks for the detailed response! I agree that it would be useful to indicate to the user when matches are ambiguous (with multiple parents).
Regarding my PR (#25), I am going to revise the labels to use the full name rather than just NextstrainClade and I think that should address this issue (unless I hear otherwise from @ktmeaton or @corneliusroemer). Will that be okay with you?

lenaschimmel commented 2 years ago

Absolutely!

ArtPoon commented 2 years ago

Ok, pushed the above change. CSV now looks like this:

sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"Omicron/BA.1/21K,Omicron/BA.2/21L",0,1,"21:14408|Omicron/BA.1/21K,15240:29759|Omicron/BA.2/21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C"

lenaschimmel / sc2rf

ENH: provide output optionally as csv/tsv for automated analysis/sharing #19

The case for results with many parents