nextstrain / forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://nextstrain.org/sars-cov-2/forecasts/
7 stars 2 forks source link

Add QC for clade sequence counts #100

Open joverlee521 opened 4 months ago

joverlee521 commented 4 months ago

Context

@marlinfiggins flagged there is an issue in all-time the analysis where there are JN.1 sequences from 2020 during the forecasts-ncov meeting on 2024-06-03

Description

Sequences counts for clades/lineages that are earlier than their first appearance date should be excluded.

Two potential solutions

  1. Use ncov 's exclude.txt to exclude known outliers from the counts. However, this only captures a subset of outliers because the exclude.txt only gets updated based on the results of small subsampled trees.
  2. Similar to covariants, add a list of first dates for clades/lineages and automatically exclude counts earlier than the first date.