gkichaev / PAINTOR_V3.0

Fast, integrative fine mapping with functional data
55 stars 21 forks source link

Posterior Probabilities not summing to integer after modifying LD script #16

Closed dkearsley closed 6 years ago

dkearsley commented 6 years ago

Hello,

I am running into an issue that is perhaps self inflicted, perhaps not, and I could use some help. I don't think it is a true Github issue, so if email is preferred, please let me know.

When running CalcLD_1KG_VCF.py, I do not want to have ambiguous SNPs removed as it may potentially remove SNPs of interest. I think GitHub user vlaufer commented here the other day with the same idea. As a result, I went into the CalcLD_1KG_VCF.py script and commented out the following from the Read_Locus function

    #drop ambiguous snps from input data and warn!
    A1_allele = temp[effect_index].upper()
    A0_allele = temp[alt_index].upper()
    #if (A1_allele == "A" and A0_allele == "T") or (A1_allele == "T" and A0_allele == "A") or \
    # (A1_allele == "G" and A0_allele == "C") or (A1_allele == "C" and A0_allele == "G"):
    #    print("Warning! Ambiguous SNP (AT or GC) in input locus. Dropping row " + str(counter) + " from data:")
    #    print(temp)
    #else:
    all_positions.append(int(temp[pos_index]))
    all_data.append(temp)
file_stream.close()
return [all_positions, all_data]`

The code runs to completion, and the .ld and .processed files are generated. However, when I run PAINTOR, the posterior probabilities do not sum to an integer. Using the exact same data but with the unmodified LD script, the postprobs add up to an integer. I've attached an text file with the top results from both.

Is there more modifications to the LD script(s), I should be making to fix this problem? Any other ideas as to what is happening?

thanks for the help! dkearsle-UnexpectedPaintorResults.txt

ghost commented 6 years ago

@dkearsley you should not comment those lines of code out unless you are sure you can unambiguously align AT and GC SNPs in another way.

The code in the CALC LD script to my knowledge contains no mechanism to appropriately align AT and GC SNPs the MAF of which is near 0.5. Thus, if you elect to comment those, you need a water tight method to assign strand to AT and GC SNPs, which can be difficult.

The best such method is to use LD in the manner in which a program such as Genotype Harmonizer works, but this is only possible if you have genotyping data for all of your populations. If you are working from association summary statistics as many of us are, you cannot use this method.

If you don't have that, in particular I would want to have manufacturer specs on the SNPs (e.g. if you are using Illumina 5M data, you could get their product description file and find out the strand the probe is assaying).

Alternatively, you can use another method such as MAF, but you'd have to take special precaution to avoid SNPs whose major and minor allele varies by population and ones who MAF is near 0.5.

VL

gkichaev commented 6 years ago

Two things. First, the values wont necessarily sum up to an integer even with the correct input-unless you only consider a single causal variant. The reason for this is slightly complicated, though for the purposes of the method its actually a good thing. Intuitively you can think of the case where you have 10 SNPs and consider 10 causal variants. If all the probabilities summed to 10 you will have gained no informaiton.

Second, echoing @vlaufer comments, I discourage you from retaining strand ambiguous SNPs unless you have the raw data and verify the +/- strand.