hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
317 stars 111 forks source link

Try to stop outputting files if not displayed or built #352

Open emmahodcroft opened 2 years ago

emmahodcroft commented 2 years ago

Continuing to output data files in cluster_tables for variants that were never displayed ("type": "do_not_display") and that are no longer being built can cause confusion. As there's no straightforward way to monitor these variants (they aren't part of any graphs, and there's no longer a Nextstrain build), and as many of those that are do_not_display are often transient builds that I make to monitor something for a while, there's a chance that something can go wrong. Most obviously, the list of mutations that define these (many of these predate being able to use Pango, or are not monophyletic and so don't have a Pango) pops up in another cluster later, making it look like the variant is having a resurgence, even though it's a 'SNP bug'.

To avoid this type of confusion, we should stop outputting fresh counts for variants that aren't displayed (checkable in graphs) and no longer being built (checkable in the build). This PR tries to do this without impacting other variants.

After merging, to make this fully functional, I should go through and edit the files for which these conditions apply, and cut off their 'counts' stop at the point they stopped being graphed... if I can (that might be too much editing).

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
covariants ✅ Ready (Inspect) Visit Preview Oct 26, 2022 at 9:49AM (UTC)
MoiraZuber commented 2 years ago

The code seems to work properly, my test-run ran without errors.

One thought about this solution though: I think having these clusters in clus_to_run but removing them from certain outputs might cause unexpected behavior that would be difficult to ascertain. This may be a naive question, but is there any reason to consider these clusters at all in our script? Would just removing them from clus_to_run at the very beginning of the script work as well? Then they would not ever be considered for any step of the analysis.

emmahodcroft commented 2 years ago

Actually that's a very good point. And in that vein, I'm not even sure why the short answer isn't just to comment out these variants entirely.............

emmahodcroft commented 2 years ago

My guess would be that I get to a point where I'm turning things off so individually (maybe one part one week, then another the next week) that I don't release I've essentially 'removed' the variant/cluster entirely and it shouldn't do anything anymore. That's why I don't realise that I could just delete/comment out the entire thing.

Given that, having a protection in the code that picks this up at the start and essentially 'disregards' these 'pointless' clades would be a good solution - and maybe a message at the end to say "Hey, this isn't doing anything anymore! why don't you just comment it out!"

MoiraZuber commented 2 years ago

I reverted the last changes and instead added an early check for unused clusters. The output currently looks like this:

These clusters will be run:  ['Alpha', 'Beta', 'Gamma', 'Delta', '21I.Delta', '21J.Delta', '21K.Omicron', '21L.Omicron', '21M.Omicron', '21K.21L', 'Omicron', '22A', '22B', '22C', '22D', '22E', '22B22E', 'Kappa', 'Delta417', 'Eta', 'Iota', '21GLambda', '21H', '20BS732', '20AS126', 'EU1', 'Epsilon', 'S439', 'S677HRobin1', 'S677PPelican', 'EU2', 'S98', 'S80', 'S626', 'S1122', 'S501', 'S484', 'S69', 'S677', 'S453', 'S477mut', 'DanishCluster', 'S18', 'S144', 'S417', 'S655', 'S681', 'ORF1aS3675', 'S677HRobin2', 'S677HYellowhammer', 'S677RRoadrunner', 'S677HHeron', 'S677HBluebird', 'S677HQuail', 'S677HMockingbird', 'Delta.299I', 'Delta.250I', 'Delta.145H', 'Delta.ORF1a3059F', 'S613', 'S_222', 'S145', 'S572']

Remark: cluster 21M.Omicron has type 'do_not_display' and nextstrain_build=False -> Will be automatically removed from this run (maybe consider commenting out this cluster?)
Remark: cluster Delta417 has type 'do_not_display' and nextstrain_build=False -> Will be automatically removed from this run (maybe consider commenting out this cluster?)
Remark: cluster DanishCluster has type 'do_not_display' and nextstrain_build=False -> Will be automatically removed from this run (maybe consider commenting out this cluster?)
Remark: cluster Delta.299I has type 'do_not_display' and nextstrain_build=False -> Will be automatically removed from this run (maybe consider commenting out this cluster?)
Remark: cluster Delta.145H has type 'do_not_display' and nextstrain_build=False -> Will be automatically removed from this run (maybe consider commenting out this cluster?)

These clusters will be run:  ['Alpha', 'Beta', 'Gamma', 'Delta', '21I.Delta', '21J.Delta', '21K.Omicron', '21L.Omicron', '21K.21L', 'Omicron', '22A', '22B', '22C', '22D', '22E', '22B22E', 'Kappa', 'Eta', 'Iota', '21GLambda', '21H', '20BS732', '20AS126', 'EU1', 'Epsilon', 'S439', 'S677HRobin1', 'S677PPelican', 'EU2', 'S98', 'S80', 'S626', 'S1122', 'S501', 'S484', 'S69', 'S677', 'S453', 'S477mut', 'S18', 'S144', 'S417', 'S655', 'S681', 'ORF1aS3675', 'S677HRobin2', 'S677HYellowhammer', 'S677RRoadrunner', 'S677HHeron', 'S677HBluebird', 'S677HQuail', 'S677HMockingbird', 'Delta.250I', 'Delta.ORF1a3059F', 'S613', 'S_222', 'S145', 'S572']

@emmahodcroft Does this look reasonable to you?