nextstrain / pathogen-repo-guide

4 stars 1 forks source link

ingest: Add lab host fields from NCBI Datasets #60

Closed joverlee521 closed 4 weeks ago

joverlee521 commented 3 months ago

Prompted by https://github.com/nextstrain/lassa/pull/19#discussion_r1707512603

Including is-lab-host and lab-host fields from NCBI Datasets will help with programmatically excluding lab passaged sequences from builds.

joverlee521 commented 3 months ago

I was curious how well annotated is the is-lab-host field, so I looked into seasonal-cov/229e to compare with the excluded passage strains.

There was only 1 sequence that matched and it was annotated as an outlier, not as lab passaged |Accession |Is Lab Host |Lab Host |dropped\_strains | |-------------|-------------|---------------|------------------------------------------------------------------------| |KY996417\.1 |true |Vero E6 cells | | |MF542265\.1 |true |Vero E6 cells | | |OQ920097\.1 |true |Rhileki cells | | |OQ920098\.1 |true |Rhileki cells | | |OQ920099\.1 |true |Rhileki cells | | |OQ920100\.1 |true |Rhileki cells | | |OQ920101\.1 |true |Rhileki cells | | |OR266950\.1 |true |Vero 81 | | |PP810610\.1 |true |MRC\-5 cells |PP810610\.1 \# extreme outlier in tree | |Y09923\.1 |true |MRC5 cells | | |KF293666\.1 | | |KF293666\.1 \# excluded because of repeated cell passage under selection| |KF293665\.1 | | |KF293665\.1 \# excluded because of repeated cell passage under selection| |KF293664\.1 | | |KF293664\.1 \# excluded because of repeated cell passage under selection| |KF293663\.1 | | |KF293663\.1 \# excluded because of repeated cell passage under selection| |KF293662\.1 | | |KF293662\.1 \# excluded because of repeated cell passage under selection| |KF285482\.1 | | |KF285482\.1 \# excluded because of repeated cell passage under selection| |KF285481\.1 | | |KF285481\.1 \# excluded because of repeated cell passage under selection| |KF285480\.1 | | |KF285480\.1 \# excluded because of repeated cell passage under selection| |KF285479\.1 | | |KF285479\.1 \# excluded because of repeated cell passage under selection| |KF285478\.1 | | |KF285478\.1 \# excluded because of repeated cell passage under selection| |KF285477\.1 | | |KF285477\.1 \# excluded because of repeated cell passage under selection| |KF285476\.1 | | |KF285476\.1 \# excluded because of repeated cell passage under selection| |KF285475\.1 | | |KF285475\.1 \# excluded because of repeated cell passage under selection| |KF285474\.1 | | |KF285474\.1 \# excluded because of repeated cell passage under selection| |KF285473\.1 | | |KF285473\.1 \# excluded because of repeated cell passage under selection| |KF285472\.1 | | |KF285472\.1 \# excluded because of repeated cell passage under selection| |KF285471\.1 | | |KF285471\.1 \# excluded because of repeated cell passage under selection| |KF285470\.1 | | |KF285470\.1 \# excluded because of repeated cell passage under selection|

This makes sense because the is-lab-host field seems to be dependent on the presence of the source/lab_host field in the GenBank record. The records that do not match do not have the source/lab_host field but they were flagged as passaged because they included in a paper.