Try an alternate workflow for getting the COVID reference tree for a specific date

bsweger commented 2 months ago

Currently, we call a nextstrain API to retrieve a reference tree for a specified date:

"""Return a reference tree as of a given date in YYYY-MM-DD format."""
    headers = {
        "Accept": "application/vnd.nextstrain.dataset.main+json",
    }
    session = get_session()
    session.headers.update(headers)

    response = requests.get(f"{base_url}@{as_of_date}", headers=headers)
    check_response(response)
    reference_data = response.json()

However, this may not be correct. Before updating the pipeline code, let's try the alternate method.

Upgrade to nextclade 3.8 or higher
Use the following command to get files from a sample date (7/3/24)

nextclade dataset get -n sars-cov-2 --tag "2024-07-03--08-29-55Z" --output-dir data/

^ or --output-zip instead of --output-dir

Use the outputs as inputs to the nextclade run command that assigns clades to the genbank sequences we got from NCBI. See here for more context: https://github.com/nextstrain/ncov-ingest/blob/c0d63f8f959705eda1bf0d3414127fc861919fe1/workflow/snakemake_rules/nextclade.smk#L175-L217
```
./{input.nextclade_path} run \
    -j {threads} \
    {input.sequences}\
    --input-dataset={input.dataset} \
    --output-tsv={output.info} \
    {params.translation_arg} \
    --output-fasta={output.alignment}
```

bsweger commented 2 months ago

This worked nicely!

# get the dataset .zip
nextclade dataset get -n sars-cov-2 --tag "2024-07-03--08-29-55Z" --output-zip zippy.zip

# use nextclade run to assign the sequences, using the genbank sequence data downloaded from NCBI and passing in the zip file created above to use as reference sequence and tree
nextclade run data/ncbi_dataset/data/genomic.fna --input-dataset=zippy.zip --output-csv tabby.csv

The commands above output roughly what we get from the existing version of the pipeline.

bsweger commented 2 months ago

Closing this, since the goal was to get enough information to ask good follow-up questions on the e-mail thread with the nextstrain folks.

reichlab / cladetime

Try an alternate workflow for getting the COVID reference tree for a specific date #7