nextstrain / seasonal-flu

Scripts. config, and snakefiles for seasonal-flu nextstrain builds
44 stars 26 forks source link

Exclude sequences with unusual collection dates #175

Open joverlee521 opened 1 month ago

joverlee521 commented 1 month ago

Context

@huddlej flagged sequences with unusual collection dates on Slack, where date == date_submitted. We should exclude these sequences from the builds because this is a clear metadata issue.

Possible solutions

  1. Add (date != date_submitted) to all of the filter queries across all configs
  2. Add a new filter rule in the main workflow to exclude these sequences for all builds
  3. Add a new filter rules in the upload workflow to exclude these sequences in our S3 files
  4. Add specific sequences to outliers.txt (e.g. https://github.com/nextstrain/seasonal-flu/commit/8209b359af8941d947e78565db983f9610f2a1ac)
huddlej commented 1 month ago

I'm a little worried about excluding these types of records algorithmically without any notification to us. Ideally, we want to catch these data issues, alert the data provider so they can fix the records, and update our records to use the correct metadata. Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.

If data providers can't or won't update their records, the outlier file approach seems reasonable.

joverlee521 commented 1 month ago

Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.

Ah that would be nice! Seems like something we can add to the upload + Nextclade workflows.

If data providers can't or won't update their records, the outlier file approach seems reasonable.

With the outlier file approach, I feel like we never go back to check if the "outliers" have been fixed. I guess if we implement the QC report, it can flag sequences that have been fixed and can be removed from the outlier files.