resolve discrepancies with Neher method

jbloom commented 1 year ago

@rneher compared the method here with his method. In general in comparison to deep mutational scanning data, the method here seems to work better probably because it has better synonymous rate estimates and a much larger sequence data set. But he noticed that there were some discrepancies at a few sites as noted below. Investigate these to make sure they don't indicate some larger problem:

The most glaring outlier is L5F (CTT -> TTT). This mutations happens frequently, https://nextstrain.org/ncov/gisaid/global/all-time?c=gt-S_5, but its count in your list is 0, hence strong de-richment. Similar situation for mutations in codon 367. The main outlier in the other direction is H1118Y. But this is probably due to the fact that H is part of Alpha and Alpha doesn't have a lot of sub-pango lineages. So my method doesn't really pick anything up.

jbloom commented 1 year ago

@rneher, would you be able to point me to the file with your estimates that you were using to make the correlations between the UShER fitness and estimates and your estimates?

Now that I understand that UShER is masking some sites, I am fixing those in my estimates and want to see to what extent that eliminates the outliers that you flagged. I figured it would be easiest if I could just do this rather than keep having to ask you, but wasn't sure what file to use for your estimates.

jbloom commented 1 year ago

Fixed in #6; sites like L5F were actually masked in the UShER tree.

jbloomlab / SARS2-mut-fitness

resolve discrepancies with Neher method #3