Open bschilder opened 4 years ago
Good point! The code actually treats MAF as FREQ (does not assumes it's <0.5), so the name is wrong but the functionality is correct. I'll keep this issue open to remind myself to change the MAF name...
Got it, thanks for clarifying!
Was just looking back at this and realized we actually have some ways of addressing this now.
MungeSumstats
does some inference of what each column means (including MAF/FRQ) and standardizes them. Specifically, these internal functions.
Beyond this, the main exported function format_sumstats
also has an arg that formats the sumstats to LDSC format automatically (format_sumstats(..., ldsc_format=TRUE)
. We designed this pipeline to cover everything that mungesumstats.py does, and much much more, so perhaps it would be worth mentioning MungeSumstats
as an alternative?
@Al-Murphy
@bschilder @AI-Murphy thanks this is awesome!
I kept this ticket open for too long because I was afraid that changing the MAF column to another name would mess things up. But I'm happy to recommend that people use your code if it's more robust and actively maintained.
Would you mind writing a shell-command snippet demonstrating how to use your package to replace PolyFun's internal munge_sumstats script? I can put this in the wiki as a recommendation.
Great! Happy to put the shell script together, will share it as soon as it's ready.
munge_polyfun_sumstats.py
In line 62 I noticed that when the columns are renamed, freq and MAF seem to be treated the same. But couldn't these two things be different in a summary stats file? Perhaps one way would be to check if 1-freq ≤ .5, and if it is then you know it's the minor allele (and then can flip the ref/alt alleles and effect, though the specifics of this might depend on the particular file format).