jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Root node does not have the earliest sampling date #102

Open szhan opened 1 year ago

szhan commented 1 year ago

I was checking the collection date of the earliest sample node in upgma-mds-1000-md-30-mm-3-2022-06-30-recinfo-il.ts.tsz. min(ti.nodes_date) returns numpy.datetime64('2019-12-25'). But when I looked at core.py, I found that REFERENCE_DATE was set to '2019-12-26', which is the correct sample collection date (https://www.nature.com/articles/s41586-020-2008-3#Sec2). It is a bit odd that the root node has a collection date one day earlier than the collection date of the earliest sample.

I suppose that it doesn't really matter since it is only metadata. Maybe we could mention somewhere (an extra entry in the metadata dict?) that this genome was chosen as the root because it is universally used as the reference sequence for SARS-CoV-2, even though there exist another sample genome with an earliest collection date.

jeromekelleher commented 1 year ago

Looks like a straight forward error on my part, we can fix in the next version