hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 111 forks source link

add check for wrong dates at beginning of script #276

Closed MoiraZuber closed 2 years ago

MoiraZuber commented 2 years ago

Add a check for dates that are before the estimated start date for each clade. This check is located at the beginning of the script (just after cleaning metadata) and only checks sequences with official Nextstrain clades.

vercel[bot] commented 2 years ago

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hodcroftlab/covariants/GyCdGepLTLJxsnzRGJFVn9PXV3q6
✅ Preview: https://covariants-git-fastdatecheck-hodcroftlab.vercel.app

emmahodcroft commented 2 years ago

I've just run this - it's so great to be getting the bad Omicrons up front!!

Just a couple of thoughts - first, the line that flashes up suggests to see the sequences with

To view, use 'alert_first_date_quick[clus][['strain','date']]'

This worked for me - but I'd hope to be able to use the two new functions you added at lines 131 and 136 (particularly as they show the 'expected date') (print_date_alerts_quick(clus) and print_all_date_alerts_quick()). However, when I tried to use these, I get the key error: KeyError: "['gisaid_epi_isl'] not in index"

After playing a little bit, I think I know the issue 😉 With your other PR, we removed all columns we don't use - and I also tested this! However, what I don't think I did was try to print out any bad dates - and then it will try to access the column gisaid_epi_isl - which I bet we don't use anywhere else 😄

So, I think the solution here is to add gisaid_epi_isl to the list of columns we read in with usecols!

I've merged master into this branch so that it has this change to the usecols and I'll push this up now - if you re-pull the branch and do a quick run you can hopefully replicate the error and try a fix!

Second thought: I think we don't want to continue the run if we have bad dates after this first check - we need to fix them. This is a bit of a double-edged sword: If one forgets completely, the whole run might finish and then you would possibly have a 'full' list of bad dates. However, if you don't forget completely but forget for 10 minutes, then you have to scroll up a long time to find the alert!

I think on the balance, at least for now, it might better to stop the run if bad dates are found. But it might be good to put this on a flag we can turn off later, if the run gets fast enough where this is perhaps less of a big deal to re-run it if you did forget. What do you think?

emmahodcroft commented 2 years ago

PS - For the moment, I've gotten the list of bad sequences from testing this, but will switch back to master for today's run, so we can continue to work on this branch :)

MoiraZuber commented 2 years ago
  1. Good point. I added the functions but then forgot about them, so I didn't test if they actually worked. I ran it with gisaid_epi_isl in usecols and it worked!

  2. This sounds like a good idea. I added a check for exiting the script. There's also an input asking for this at the beginning (after the usual questions), with default being y ("yes" meaning "exit program upon finding bad dates").

emmahodcroft commented 2 years ago

This looks great Moira - thanks! I'll try to test it out tomorrow 😁

Also, good call on having an option to exit the script or not - didn't even think of that!!

emmahodcroft commented 2 years ago

Hi Moira - This is looking great!

I've added one small change, to suggest using the two new functions in the alert message (ex: print_all_date_alerts_quick()) instead of the longer typed alert_first_date_quick[clus][['strain','date']]). I'm running now, and I'll merge into master if all looks good :)

emmahodcroft commented 2 years ago

This is looking great, and makes the runs a lot more efficient, to pick up bad dates early instead of at the end, saving time in long re-runs!

I final-tested this by adding the data from today's run, so will merge with master now and push live.