nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

The `phylogenetic` github action cannot build from staged `ingest` data during dev-branch testing #49

Closed j23414 closed 2 months ago

j23414 commented 2 months ago

Current Behavior

The phylogenetic GitHub action (see this run) ignored the provided sequence and metadata URL configurations:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/sequences_all.fasta.zst
    METADATA_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/metadata_all.tsv.zst

These input fields are specified in the .github/workflows file: https://github.com/nextstrain/dengue/blob/e901a30436113df8621d8237bffd161a5bc9256e/.github/workflows/phylogenetic.yaml#L33-L42

However, they are not being used in the phylogenetic rule: https://github.com/nextstrain/dengue/blob/e901a30436113df8621d8237bffd161a5bc9256e/phylogenetic/rules/prepare_sequences.smk#L24-L25

Expected Behavior

The phylogenetic GitHub action should accept sequences and metadata from specified URLs, especially when testing different features on dev branches. These URL datasets are often generated by the ingest GitHub action and should be a configurable-optional-input dataset during feature testing.

Possible Solution(s)

Consider implementing changes similar to the Zika repository, but with the addition of allowing for serotype expansion ( all, denv1, denv2, denv3, denv4).

Hopefully, then we could provide a config similar to:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/sequences_{serotype}.fasta.zst
    METADATA_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/metadata_{serotype}.tsv.zst

Which could be expanded across all serotypes. Otherwise, solving this issue might involve defining multiple sets of SEQUENCES_DENVX_URL and METADATA_DENVX_URL fields, which would be tedious during testing dev-branches. Alternatively, consider splitting phylogenetic into separate workflows (or workflow calls from a main workflow) for each serotype (phylogenetic_denv1 to phylogenetic_denv4). Open to discussion or suggestions.