Some problems in Computing LD-scores using existing functional annotation files

omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)

MIT License

85 stars 21 forks source link

Some problems in Computing LD-scores using existing functional annotation files #180

Closed Y-Isaac closed 5 months ago

Y-Isaac commented 6 months ago

HI,

Thanks for this useful tool.

I'm planning to perform functional informed fine-mapping on GWAS results from UKBiobank. Thus, I've directly used the annotation files provided on your wiki page 2 and plan to generate LDscore files based on these. However, due to differences in the variant QC process, there are some discrepancies in the variants.

Taking chromosome 22 as an example, the annotation file contains 271,699 variants, while my gene file contains 179,766 variants, with an overlap of 175,497 variants between them. But when I proceeded to create the LDscore file, polyfun.py indicated: "Keeping only 162,246/179,766 SNPs that have annotations." I noticed this phenomenon occurs across all 22 chromosomes. Upon closer inspection of the 175,497 - 162,246 process, I found that the filtered variants do exist in the annotation file, but they were not annotated by any annotation scheme; in other words, for these variants, the annotation results were "all zeros."

My question is, do you think that directly deleting such variants is the most appropriate approach? Would this not lead to a decrease in variant density and, consequently, a decrease in the resolution of fine-mapping? In my opinion, it seems better to assign these variants a lower, or even zero, prior probability.

Additionally, since I also need to calculate the ld.M file, should this file be calculated based on the variants present in the LDscore file (i.e., the filtered variants), rather than the annotation file?

Thank you in advance for your response!

omerwe commented 6 months ago

@Y-Isaac which annotations are you using? Some of the Baseline-LD annotations are MAF bins annotations, which assign each SNP a value of one if its MAF falls within a certain range, and zero otherwise. Since each SNP's MAF must fall within one bin, each SNP must get at least one non-zero annotation.

In general, you should aim to never delete any SNPs before fine-mapping, simply because you might accidentally delete the truly causal SNPs that you want to find.

Y-Isaac commented 6 months ago

@Y-Isaac which annotations are you using? Some of the Baseline-LD annotations are MAF bins annotations, which assign each SNP a value of one if its MAF falls within a certain range, and zero otherwise. Since each SNP's MAF must fall within one bin, each SNP must get at least one non-zero annotation.

In general, you should aim to never delete any SNPs before fine-mapping, simply because you might accidentally delete the truly causal SNPs that you want to find.

Hi omerwe,

The annotation file I applied is located at the top of the Polyfun wiki page2 “existing annotation files”, which corresponds to this sentence: We provide functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations.

I rechecked those differential loci, and I found that most of them are due to the opposite alleles between the gene file and the annotation file, so the program cannot recognize them. Another part is due to the variant QC process adopted by the UKB official, which retains those variants with MAF<0.001 (but >1e-6) in the coding area, and these loci do not exist in the annotation file you provided. I guess to ensure the accuracy of fine-mapping, maybe I should add these missing loci back into the annotation file using the baseline-LF 2.2 model? Although this is indeed a bit difficult for me.

The last part of the differential loci is caused by the script’s filtering program. I am quite curious about the standard for this part of the removal, because I found that it was also carried out on chromosome 22, and I guess it should not be due to the HLA region.

In summary, thank you very much for your guidance!

omerwe commented 6 months ago

@Y-Isaac PolyFun should handle flipped alleles, so I'm not sure why the SNPs with flipped alleles were removed? When exactly were they removed?

In general, you should aim to match your SNPs as much as possible with those in the annotation files (for the same reason as before --- every SNP you remove is potentially a causal SNP, so you don't want to miss it).

I'm not sure what you mean about the script's filtering program?

Y-Isaac commented 6 months ago

@omerwe Sorry, I will try my best to describe what happened.

I attempted to apply polyfun to the UKB data. When creating the S-LDSC files, I found that nearly one million loci indicated "have no annotations," causing them to disappear from the ldscore files. However, I confirmed that the names of these loci indeed exist in the annotation file(and these SNPs are all INDEL, I'm not sure what happened). Therefore, I wondered if the mismatch between alleles in the gene file and the annotation file might be the cause. I tried swapping the alleles for these loci, setting the original REF allele as ALT and the original ALT allele as REF. Fortunately, this was effective, and I recovered this part of the "missing SNP."

The "script's filtering program" I referred to is observed in the log file. Taking chromosome 1 as an example, the log file generated by compute_ldscores.py shows: [WARNING] Not all SNPs have annotation values [WARNING] Keeping only 1026022/1045645 SNPs that have annotations [INFO] Applying SNPs filter... [INFO] After filtering, 1025875 SNPs remain [INFO] Applying initial ldscores loop [INFO] Applying main ldscores loop... I am particularly curious about what exactly happened in the process from 1026022 SNPs to 1025875 SNPs?

Thanks for your help!

omerwe commented 6 months ago

@Y-Isaac a few things:

First, please note that for indels, swapping the alleles actually changes the meaning of the indel. This is why Polyfun doesn't treat an indel with flipped alleles as if it were the same as what we have in the annotation files. See issues #41, #66, #147 for more info.

Second, the second filtering step in the log file removes SNPs that either have MAF=0 in the data (i.e., they are not really SNPs) or are 100% heterozygotic.

Hope this helps, please let me know if not.

Y-Isaac commented 6 months ago

@omerwe HI,

I think I get your idea! It's necessary to keep high caution with INDEL variants, and your meticulousness in script creation is admirable. Given both the annotation and gene files are from UKB, these INDEL discrepancies likely stem from "Allele flip" in previous work rather than actual differences. The --allow-swapped-indel-alleles option mentioned in #66 seems useful for this. Despite not being proficient in modifying Python scripts, I plan to try.

Further inspection of the differing INDELs suggests compute_ldscores.py compares the 5th column of the bim file with the 4th column of the annotation file (where the software expects them the REF allele). However, in my bim file, the REF allele is in the 6th column, not the 5th. I'm unsure if my speculation is correct (it seems so upon my examination. As previously mentioned, I'm not very proficient in Python...).

Thank you sincerely for your help; you've been incredibly supportive!

Best regards, Issac

Y-Isaac commented 6 months ago

Additionally, I've noticed that in the entire section "1. Computing prior causal probabilities with PolyFun," A1 is interpreted as the ref-allele. However, often in GWAS summary files, A1 is interpreted as the risk allele (more frequently equated with the alt-allele, especially in plink association analyses), which seems to inevitably lead to some inconsistencies in alleles for certain SNPs (including INDELs as well). Therefore, manually adjusting alleles might be necessary in the practice with PolyFun.

Is my understanding correct? Do you have any suggestions on this matter?

Best regards, Issac

omerwe commented 5 months ago

@Y-Isaac the definition of "risk allele" is subjective. If you consistently switch your risk allele with the reference allele, the only implication is that you'll have to multiply your effect size estimates by -1 (at least for the per-normalized SNP effect size).