nextstrain / zika

Nextstrain build for Zika virus
https://nextstrain.org/zika
8 stars 10 forks source link

phylogenetic: Add root-sequence.json to rule `all` #56

Closed joverlee521 closed 2 months ago

joverlee521 commented 2 months ago

Description of proposed changes

Explicitly state that the root-sequence.json file is an expected output of the core phylogenetic workflow.

This also ensures that the Nextstrain automation rule deploy_all will include the root-sequence.json in the upload.

Related issue(s)

Follow up to https://github.com/nextstrain/zika/pull/51

Checklist

jameshadfield commented 2 months ago

I would consider in-lining the root sequence in the main JSON. Similarly for the related PRs recently made for other pathogens with small genomes. ("Small" being subjective, it's up to the workflow maintainer to decide if in-lining is appropriate.)

j23414 commented 2 months ago

Huh, just saw the augur export --include-root-sequence-inline along with docs:

--include-root-sequence-inline Export the root sequence (reference sequence for vcf) used to identify mutations as part of the main dataset JSON. This should only be used for small genomes for file size reasons. Default: False

Zika genome is around 10kb. @jameshadfield what range of lengths would you consider a small genome to switch to in-lining the root sequence? I see the same question in Augur PR 1295.

I'm not seeing --include-root-sequence-inline used in other repos from a github code search.

joverlee521 commented 2 months ago

I would consider in-lining the root sequence in the main JSON. Similarly for the related PRs recently made for other pathogens with small genomes. ("Small" being subjective, it's up to the workflow maintainer to decide if in-lining is appropriate.)

Sounds good, I'll update here in a separate PR and update the other related PRs.

what range of lengths would you consider a small genome to switch to in-lining the root sequence?

I haven't considered the actual length of the sequences, but I'm basing this decision on the size the existing root-sequence.json on S3.

$ aws s3 ls --human-readable s3://nextstrain-data/zika
2024-04-13 09:17:04  162.9 KiB zika.json
2023-05-19 18:48:18   58.7 KiB zika_entropy.json
2023-05-19 18:48:18    6.7 KiB zika_meta.json
2024-03-04 16:19:20    5.7 KiB zika_root-sequence.json
2023-05-19 18:48:18   44.5 KiB zika_sequences.json
2023-05-19 18:48:18  186.5 KiB zika_tree.json

The 5 KiB root-sequence.json is pretty negligible when the main Auspice JSON is only 163 KiB. Nextstrain datasets are limited by the 500MB memory cap in Chrome, so we'd be fine adding the root sequence inline.