nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Add recombinant to legacy mapping #407

Closed emmahodcroft closed 1 year ago

emmahodcroft commented 1 year ago

One possibly clade output from Nextclade is recombinant. However, in the move to change the clade names, and remap them, this has gotten lost.

I use this in CoVariants so it would be great if we can add recombinant back in. I think this is as simple as one-to-one mapping, but I would appreciate if someone could run a test to check this does work as expected.

corneliusroemer commented 1 year ago

Thanks for noticing my oversight in adding recombinant to the mapping. I could have noticed easily using:

zstdcat metadata.tsv.zst | tsv-summarize -H --count -g Nextstrain_clade

Which shows that there are some ? there:

image

There's going to be still some ? left which is for sequences for which alignment fails. That might be a change, previously this might have been empty string ``. That should be ok for covariants?

corneliusroemer commented 1 year ago

This doesn't require full reruns, as the clade mapping happens on the full nextclade.tsv.

So in <24hr you should have the recombinant clade in metadata @emmahodcroft

rule generate_metadata:
    input:
        nextclade_tsv=f"data/{database}/nextclade.tsv",
        nextclade_21L_tsv=f"data/{database}/nextclade_21L.tsv",
        existing_metadata=f"data/{database}/metadata_transformed.tsv",
        clade_legacy_mapping="defaults/clade-legacy-mapping.yml",
    output:
        metadata=f"data/{database}/metadata.tsv",
    benchmark:
        f"benchmarks/generate_metadata_{database}.txt"
    shell:
        """
        ./bin/join-metadata-and-clades \
            --metadata {input.existing_metadata} \
            --nextclade-tsv {input.nextclade_tsv} \
            --nextclade-21L-tsv {input.nextclade_21L_tsv} \
            --clade-legacy-mapping {input.clade_legacy_mapping} \
            -o {output.metadata}
        """
emmahodcroft commented 1 year ago

Thanks @corneliusroemer appreciate you looking this over!

Yes, I think things that get no call should be fine whether ` or?` - I only look for Nextclade calls that match the clades I track (so I also ignore 20A etc), and everything else I check for SNPs to assign myself.