Open szhan opened 1 month ago
I'll take a look at the breakdown of samples filtered out at varying values of the maximum N threshold. I went with 800 Ns initially, because it is roughly two amplicons (e.g. if the terminal amplicons drop out). Maybe it's tossing out too many samples.
I've been filtering out the samples before importing the alignments. Probably, it is better to implement a simple filter based on the number of Ns to filter out sample during inference.
Agreed - let's keep as much of the filtering and data pre-processing logic within sc2ts as we can
Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments
. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.
Or maybe keep a boolean array to keep track of which samples pass filters, and then use it to subset the genotype matrix before input to HMM matching.
Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.
That's OK I think - we can easily break preprocess_and_match_alignments
into steps, or add some complexity where we only pass alignments that meet QC requirements on to the matching step.
Also, there are a number of entries in the metadata file that do not have full-precision dates. I have been filtering them out before import the metadata. It is better that this, too, is done within sc2ts.
It seems that both of these filters (and any other filter on the metadata and alignments) can be done in preprocess_and_match_alignments
. Or refactor it into preprocess_samples
and match_alignments
, where we can implement the filters.
Hmm, actually, about the full-precision dates from metadata, I don't think get
is getting entries by comparing dates. It is just getting entries by comparing dates in the form of strings. So, I don't think it needs to be modified.
Before doing runs, I have been filtering out samples in the Viridian dataset based on two criteria: (1) having full-precision collection dates, and (2) having at most 800 Ns (excluding gaps) in the aligned consensus sequence (i.e. disregarding insertions). A better way is to exclude problematic sites before filtering by the maximum N criterion.