outbreak-info / outbreak.info

During outbreaks of emerging diseases such as COVID-19, efficiently collecting, sharing, and integrating data is critical to scientific research. outbreak.info is a resource to aggregate all this information into a single location.
https://outbreak.info/
GNU General Public License v3.0
33 stars 13 forks source link

Date found different from PANGO record #377

Open ftrebien opened 3 years ago

ftrebien commented 3 years ago

The date found for lineage B.1.1.7 in Spain is different from the earliest sample date in Spain in PANGO. Why is there a difference? This difference also exists for B.1.351 in Qatar (PANGO) and P.1 in the US (PANGO).

gkarthik commented 3 years ago

Hello @ftrebien , seems like one of the dates for the sequences from Spain has the wrong metadata (collection date: 2020-02-*). We expect this to be corrected on GISAID soon following which the changes will be reflected on outbreak.info.

babarlelephant commented 3 years ago

I would disallow (at least for the lineages having > 5000 sequences) the oldest date to decrease by more than 2 months when new sequences are added, this should filter a lot of such metadata and assignment errors, perhaps also excluding the sequences not satisfying some molecular clock constraint during the oldest sample calculation. @gkarthik

flaneuse commented 3 years ago

Thanks @babarlelephant, we've been thinking for a long time how to better identify and filter out erroneous date metadata. We've been planning on a simple first/second date check as you suggest to limit the compute time associated with the date check, and only applying it to lineages with a certain number of sequences is a good idea. We'll keep you posted-- it seems like every month there's a mislabeled B.1.1.7 sequence.