nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Discussion: Do we still need the `snp_clusters` param for `diagnostics` rule? #1017

Open joverlee521 opened 1 year ago

joverlee521 commented 1 year ago

Context

@huddlej's response to a discussion question prompted me to take a closer look at the diagnostics rule/script. I wanted to understand how the hard-coded params were used in the script: https://github.com/nextstrain/ncov/blob/44ab71ab3bace68c59bd5c688587e6e30a4a4fc4/workflow/snakemake_rules/main_workflow.smk#L536-L539

I saw that the snp_clusters param is only relevant when there is a snp_clusters column within the input metadata file.

https://github.com/nextstrain/ncov/blob/44ab71ab3bace68c59bd5c688587e6e30a4a4fc4/scripts/diagnostic.py#L84-L87

This input metadata file is generated by the join-metadata-and-clades script, which does not add a snp_clusters column. Only the following columns from Nextclade are included in the metadata file:

https://github.com/nextstrain/ncov/blob/44ab71ab3bace68c59bd5c688587e6e30a4a4fc4/scripts/join-metadata-and-clades.py#L19-L41

It seems like the snp_clusters param is unused unless users add the column to their own metadata outside the workflow. snp_clusters used to be included as a column in the ncov-ingest produced metadata file, but it has been removed. I think the presence of an unused param here can cause confusion for users (it definitely confused me!).

Possible solution

I'm not familiar with the diagnosis rule so I wanted to ask what would be an appropriate action here.

  1. Add a comment to the diagnostics rule that the snp_clusters is only kept for backwards compatibility, but it is not used in the latest version of the workflow.
  2. Just remove the snp_clusters param from the diagnostics rule.
  3. Update the diagnostics script to check QC_snp_clusters != "bad" instead of the number of SNP clusters.
j23414 commented 1 year ago

I lean toward possible solution 2, but I'm probably missing the historical context.

emmahodcroft commented 1 year ago

My guess is that this was removed when we tried to pare down what we were bringing in from Nextclade to keep it to only things that were of most use. I imagine it won't come back and it's hard for me to imagine that users are supplying their own value here. However @rneher & @corneliusroemer might be able to provide a bit more context just in case we're missing anything.

corneliusroemer commented 1 year ago

This predates me, I can't really comment here.