jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Negative submission delay #185

Closed szhan closed 2 days ago

szhan commented 1 month ago

When looking at the metadata of the samples (reportedly collected in 2020) from the data (version 0.4) from Hunt et al. (2024), I found some instances of negative submission delays.

Screenshot 2024-05-21 at 09 09 16

I then noticed that such cases are not being filtered out before inference (see the code here).

Perhaps we should modify the filtering condition with:

0 <= sample.submission_delay < max_submission_delay

szhan commented 1 month ago

It's not entirely clear to me what the submission dates mean. The submission dates are aggregated from three sources: INSDC, GISAID, and COG UK. I'm not sure if the submission date is the date on which the submitter created a submission entry or the date on which the submitter actually hit the submit button.

szhan commented 1 month ago

Just checking some emails, the entries in ENA have first_created and first_public dates. Here we are taking first_created dates as the submission dates.

jeromekelleher commented 1 month ago

Probably best to filter negative for now

szhan commented 1 month ago

The 2020 trees with and without filtering samples with negative submission delay (n = 633) look very similar overall. I don't see a dramatic decrease in the number of reversions or immediate reversions. Also, the number of recombinants is the same (n = 26).

szhan commented 1 month ago

On a related note, I don't think that we are filtering by submission delay with this new Viridian dataset like we did before with the GISAID dataset, because the submission dates are not equivalent. I was just reading Martin's email again, and it seems what we have as submission dates are the dates when submission entries are created on the ENA, not GISAID. I think that it is quite probable that many groups submitted lots of their FastQs to the ENA (much) later than when they submitted their genome sequences to GISAID. This may explain why we filter out so many samples using a threshold of max submission delay of 30 days (as shown above).

I wanted to match the samples in the metadata files that Kat and Martin provided in order to see how different the submission dates are. But Kat's file contains GISAID ids and strain names, whereas Martin's file contains GenBank and ENA accessions.

szhan commented 1 month ago

Given the amount of data being excluded using a max submission delay of 30 days, and that the GISAID submission dates and ENA submission dates don't seem equivalent, it makes sense to not rely on submission delay-based filter to exclude probable time travellers. Instead, let's see how much HMM cost will help.

szhan commented 2 days ago

I think we have decided to pursue using HMM cost to filter out probable time travellers. There are some signs that the HMM cost strategy helps (e.g., reducing the number of mutations in some 20202 ARGs; see #188). Unless we are going to build ARGs out of GISAID data again, we probably don't need to deal with negative submission delays, so closing this for now.