lpipes / SARS_CoV_2_wastewater_surveillance

Method for estimating relative proportions of SARS-CoV-2 strains from wastewater samples
GNU General Public License v3.0
4 stars 3 forks source link

Imputing gaps? #7

Open Ellmen opened 2 years ago

Ellmen commented 2 years ago

This is a nice method! I'm a masters student at the University of Waterloo working on wastewater surveillance in Canada.

I wasn't able to get the tree imputation working but the common_allele method doesn't treat - as a known character which causes the method to impute gaps. This is probably undesirable since some omicron and delta have several deletions which are imputed to the most common base and then read as mismatches.

lpipes commented 2 years ago

Can you post your tree file? The program doesn't handle multi-line FASTAs. It expects the sequence in one line. Also, the program can't handle internal node names so it's best to remove them. Also, every leaf in your tree should correspond to a sequence in the MSA.

Ellmen commented 2 years ago

I created a pull request to support multiline fastas if you're interested. I didn't build a tree, I just used the common_allele method. Do you think it would make sense to add gaps as a fifth character (in addition to ACGT)?

lpipes commented 2 years ago

The tree imputation is magnitudes of order more accurate than the common allele imputation. I just included the common allele imputation option as an extra feature to the program since we used it in our manuscript. Gaps are part of the missing data that is being imputed. Gaps and any other ambiguous character (non-ACGT) are what is being imputed. If you post the error that you're getting or if you post a small sample file to test, I can try to troubleshoot for you.