Closed joverlee521 closed 2 months ago
I would consider in-lining the root sequence in the main JSON. Similarly for the related PRs recently made for other pathogens with small genomes. ("Small" being subjective, it's up to the workflow maintainer to decide if in-lining is appropriate.)
Huh, just saw the augur export --include-root-sequence-inline
along with docs:
--include-root-sequence-inline Export the root sequence (reference sequence for vcf) used to identify mutations as part of the main dataset JSON. This should only be used for small genomes for file size reasons. Default: False
Zika genome is around 10kb. @jameshadfield what range of lengths would you consider a small genome to switch to in-lining the root sequence
? I see the same question in Augur PR 1295.
I'm not seeing --include-root-sequence-inline
used in other repos from a github code search.
I would consider in-lining the root sequence in the main JSON. Similarly for the related PRs recently made for other pathogens with small genomes. ("Small" being subjective, it's up to the workflow maintainer to decide if in-lining is appropriate.)
Sounds good, I'll update here in a separate PR and update the other related PRs.
what range of lengths would you consider a small genome to switch to in-lining the root sequence?
I haven't considered the actual length of the sequences, but I'm basing this decision on the size the existing root-sequence.json
on S3.
$ aws s3 ls --human-readable s3://nextstrain-data/zika
2024-04-13 09:17:04 162.9 KiB zika.json
2023-05-19 18:48:18 58.7 KiB zika_entropy.json
2023-05-19 18:48:18 6.7 KiB zika_meta.json
2024-03-04 16:19:20 5.7 KiB zika_root-sequence.json
2023-05-19 18:48:18 44.5 KiB zika_sequences.json
2023-05-19 18:48:18 186.5 KiB zika_tree.json
The 5 KiB root-sequence.json is pretty negligible when the main Auspice JSON is only 163 KiB. Nextstrain datasets are limited by the 500MB memory cap in Chrome, so we'd be fine adding the root sequence inline.
Description of proposed changes
Explicitly state that the root-sequence.json file is an expected output of the core phylogenetic workflow.
This also ensures that the Nextstrain automation rule
deploy_all
will include the root-sequence.json in the upload.Related issue(s)
Follow up to https://github.com/nextstrain/zika/pull/51
Checklist