nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Fix COG-UK data #191

Closed rneher closed 3 years ago

rneher commented 3 years ago

Most of the COG-UK is currently not included in our open builds despite being on genbank because of insufficient metadata. But COG-UK explicitly suggests to use metadata provided om CLIMB to patch this. Below is a sample comment in the genbank record.

COG_ACCESSION:COG-UK/BCNYJH/BIRM:20210316_1221_X5_FAP43200_37a8944f; COG_BASIC_QC COG_HIGH_QC:PASS; 
COG_NOTE:Sample metadata and QC flags may been updated since deposition in public databases. COG recommends users refer to data.covid19.climb.ac.uk for metadata and QC tables before conducting analysis.

Linking of the of the genbank data could either happen via the COG_ACCESSION in the comment, or via the ena_sample.secondary_accession. This currently exists for 339000 samples in genbank.

The following code snippet would load these data from CLIMB and turn them into a data frame indexed by ena_sample.secondary_accession that is also present in the genbank.ndjson as "biosample_accession":"SAMEA8463954".

import pandas as pd
import uuid

def load_coguk_sample_accessions():
    a = pd.read_csv("https://cog-uk.s3.climb.ac.uk/accessions/latest.tsv", sep='\t', index_col="central_sample_id")
    return a.loc[~a.index.duplicated(keep='first')]

def load_coguk_metadata():
    # function to generated sample id from sequence name
    def get_sample_id(x):
        try:
            sample_id = x.split('/')[1]
        except:
            sample_id = f'problemsample_{str(uuid.uuid4())}'
        return sample_id

    metadata = pd.read_csv("https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_metadata.csv", sep=',')[['sequence_name', 'country', 'adm1', 'is_pillar_2', 'sample_date','epi_week', 'lineage', 'lineages_version']]
    coguk_sample_ids = metadata.sequence_name.apply(get_sample_id)
    metadata.index=coguk_sample_ids
    return metadata.loc[~metadata.index.duplicated(keep='first')]

def join_coguk_metadata():
    return pd.concat([load_coguk_metadata(), load_coguk_sample_accessions()], axis='columns', copy=False)

def coguk_by_sample_accession(data):
    has_sample = data.loc[~data["ena_sample.secondary_accession"].isna()]
    has_sample.index = has_sample["ena_sample.secondary_accession"]
    return has_sample.loc[~has_sample.index.duplicated(keep='first')]

meta = join_coguk_metadata()
meta_by_biosample = coguk_by_sample_accession(meta)

To make this work, we would need to add this as a filter in transform-genbank somewhere here:

https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-genbank#L414

rneher commented 3 years ago

@trs @joverlee521 @WandrilleD, does this sound reasonable?