Closed raikens1 closed 6 years ago
Think it's an issue with my gw_counts files. When I do it out myself with my new script site_counter.py, I get counts about 10% lower than what Varun gave me. Maybe he used a different region set than nc_regions for getting the files he ended up passing along.
In any case, I'm recalculating those now, and will open a pull request soon to rerun my analyses with the new counts. This is on branch recalculate_gw_counts #16
Yup, it's that. Opened a pull request and recalculating eeeeverything
Or... not. Blargh. Model still underestimates after all the recalculation. The change was a fraction of a percent.
Got it. There are some truncated files upstream in my analysis right after the filter_private step. This will require some serious retooling, but I'm pretty sure that's it.
The truncated files are: chr 2, 3, 4, 6, from AFR chr 1, 2, 3, 4, 5, 6 from EAS chr 2, 3, 4, 5, 6 from SAS chr 1, 2, 3, 4, 5, 6 from EUR
And let's not forget that anything COSMO, AF, AMR, or subpop-wise probably also has to be rerun.
Working on this on branch fix_truncation. #18
Need to check if this works now.
Ran polymorphism_predictor on all chromosomes and found that there seems to be a systematic underestimation of the number of polymorphisms we expect compared to the number observed.