Filtering out samples in Viridian v0.4 dataset

jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data

MIT License

4 stars 3 forks source link

Filtering out samples in Viridian v0.4 dataset #204

Open szhan opened 1 month ago

szhan commented 1 month ago

Before doing runs, I have been filtering out samples in the Viridian dataset based on two criteria: (1) having full-precision collection dates, and (2) having at most 800 Ns (excluding gaps) in the aligned consensus sequence (i.e. disregarding insertions). A better way is to exclude problematic sites before filtering by the maximum N criterion.

szhan commented 1 month ago

I'll take a look at the breakdown of samples filtered out at varying values of the maximum N threshold. I went with 800 Ns initially, because it is roughly two amplicons (e.g. if the terminal amplicons drop out). Maybe it's tossing out too many samples.

szhan commented 1 month ago

I've been filtering out the samples before importing the alignments. Probably, it is better to implement a simple filter based on the number of Ns to filter out sample during inference.

jeromekelleher commented 1 month ago

Agreed - let's keep as much of the filtering and data pre-processing logic within sc2ts as we can

szhan commented 1 month ago

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

szhan commented 1 month ago

Or maybe keep a boolean array to keep track of which samples pass filters, and then use it to subset the genotype matrix before input to HMM matching.

jeromekelleher commented 1 month ago

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

That's OK I think - we can easily break preprocess_and_match_alignments into steps, or add some complexity where we only pass alignments that meet QC requirements on to the matching step.

szhan commented 1 month ago

Also, there are a number of entries in the metadata file that do not have full-precision dates. I have been filtering them out before import the metadata. It is better that this, too, is done within sc2ts.

szhan commented 1 month ago

It seems that both of these filters (and any other filter on the metadata and alignments) can be done in preprocess_and_match_alignments. Or refactor it into preprocess_samples and match_alignments, where we can implement the filters.

szhan commented 1 month ago

Hmm, actually, about the full-precision dates from metadata, I don't think get is getting entries by comparing dates. It is just getting entries by comparing dates in the form of strings. So, I don't think it needs to be modified.