nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Pull data from "SARS-CoV-2 Sequence Data from Germany" #331

Closed joverlee521 closed 1 year ago

joverlee521 commented 2 years ago

Context

Similar to #329, there has been a significant drop off in sequences from Germany in the NCBI data since ~April 2022 (this issue was originally raised by @corneliusroemer in Slack):

Screen Shot 2022-07-29 at 10 40 29 AM

Description

We can update the open pipeline to pull metadata and sequences directly from the "SARS-CoV-2 Sequence Data from Germany" GitHub repo.

Possible solution

Similar solutions from #329 will apply here.

joverlee521 commented 2 years ago

I think the solution for directly pulling RKI sequences will be similar to my ideas for the COG-UK data.

The different step here might be how to remove the RKI sequences from the GenBank data. I have not found an accession linkage file for the RKI data. However, we can do a blanket removal of all sequences linked to the RKI BioProject.

corneliusroemer commented 2 years ago

We could simply remove all German sequences uploaded to Genbank from March 2022 onwards and only spike from Germany's repo from that date onwards as a quick fix. This would be 80/20, more effort may not be worth it.

joverlee521 commented 1 year ago

Resolved by #365