robert-koch-institut / SARS-CoV-2-Sequenzdaten_aus_Deutschland

Ein zentraler Bestandteil einer erfolgreichen Erregersurveillance ist das Verständnis der Verbreitung eines Erregers sowie seiner pathogenen Eigenschaften. Hierbei stellt das Wissen über das Erregergenom eine wichtige Informationsquelle dar. So erlaubt der Nachweis von Mutationen im Genom eines Erregers, Verwandtschaftsbeziehungen zu rekonstruie...
https://robert-koch-institut.github.io/SARS-CoV-2-Sequenzdaten_aus_Deutschland/
Creative Commons Attribution 4.0 International
68 stars 7 forks source link

About 10k sequences miss pango calls in "Entwicklungslinien" #2

Closed corneliusroemer closed 2 years ago

corneliusroemer commented 2 years ago

When joining metadata and pango calls (from Entwicklungslinien) I noticed that about 10k sequences seem to have no pango calls.

What's the reason for this? Did these sequences not pass pango's QC requirements?

Here's a list of the affected sequence ids: missing_pango.csv

Here's head/tail:

IMS_ID,lineage,scorpio_call,IMS_ID,DATE_DRAW,SEQ_REASON,PROCESSING_DATE,SENDING_LAB_PC,SEQUENCING_LAB_PC
,,,IMS-10294-CVDP-00002,2021-01-14,X,2021-01-25,40225,40225
,,,IMS-10294-CVDP-00016,2020-08-18,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00022,2020-08-20,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00025,2020-08-28,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00027,2020-08-28,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00028,2020-08-28,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00030,2020-08-28,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00036,2020-08-26,X,2021-01-27,40210,40225
,,,IMS-10294-CVDP-00044,2020-09-01,X,2021-01-27,40210,40225
...
,,,IMS-10186-CVDP-CCD10BDC-1C8A-4599-B0A7-881ADBAB9C1D,2021-11-05,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-A5392812-199F-42EC-A93E-90556914EEC1,2021-11-08,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-5178419A-C7A0-4B3D-A944-0AB2ABDA82F3,2021-11-04,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-6F4F6FC4-7958-465C-A492-DC226473FDED,2021-11-05,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-A1B5C007-774C-4785-84E8-434639FFCB07,2021-11-04,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-97A13E2A-E6C9-483F-9778-01EACACA16CA,2021-11-04,N,2021-11-25,40210,40210
,,,IMS-10186-CVDP-A2D8E787-01C8-4AD1-B3BB-AF3354B76914,2021-11-04,N,2021-11-25,40210,40210
,,,IMS-10267-CVDP-ED09CC2B-A80B-4EA4-943A-738F1283601C,2021-11-14,N,2021-11-25,23564,44137
,,,IMS-10267-CVDP-A9AA4B6D-1521-490B-89F4-F6341008692A,2021-11-16,N,2021-11-25,23564,44137
,,,IMS-10267-CVDP-FEF692DE-86C2-4A5F-A28C-0FCAC1BD9F79,2021-11-17,N,2021-11-25,44137,44137
cuehs commented 2 years ago

Hi @corneliusroemer

yes indeed. These are sequences which failed some of our QC. In this case no pango lineage is called. Our readme needs to be updated (see also #3 & #4)

If requested we could provide (a potentially incomplete) new columns which gives a reason if and why QC has failed. Would that be helpful to you?

corneliusroemer commented 2 years ago

For me it's fine to just know that if sequences fail QC they don't get a pango lineage.

Exact reason is not necessarily important to me, but may be of interest to submitters.

If you want to do more by way of QC, you could also run Nextclade on the sequences. It gives more details about frame shifts, stop codons, etc.