Open joverlee521 opened 1 month ago
I'm a little worried about excluding these types of records algorithmically without any notification to us. Ideally, we want to catch these data issues, alert the data provider so they can fix the records, and update our records to use the correct metadata. Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.
If data providers can't or won't update their records, the outlier file approach seems reasonable.
Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.
Ah that would be nice! Seems like something we can add to the upload + Nextclade workflows.
If data providers can't or won't update their records, the outlier file approach seems reasonable.
With the outlier file approach, I feel like we never go back to check if the "outliers" have been fixed. I guess if we implement the QC report, it can flag sequences that have been fixed and can be removed from the outlier files.
Context
@huddlej flagged sequences with unusual collection dates on Slack, where
date
==date_submitted
. We should exclude these sequences from the builds because this is a clear metadata issue.Possible solutions
(date != date_submitted)
to all of the filter queries across all configs