cannot specify specific annotations when using multiple annotation files

cmlakhan commented 3 years ago

I am trying to run LDSC with multiple annotation files. I would like to specify the set of annotations from each file but I believe there is a bug in the code that does not allow that to happen. I noticed in the function

read_ld_sumstats

function the following code is problematic:

    if args.anno is not None:
        cols_to_keep = np.zeros(len(ref_ld.columns), dtype=np.bool)        
        annotations = args.anno.split(',')
        if np.any(~np.isin(annotations, ref_ld.columns.str[:-2])):
            raise ValueError('Not all annotations specified with --anno are found in the LD scores file')        
        cols_to_keep = (ref_ld.columns.str[:-2].isin(annotations)) | (ref_ld.columns.isin(['CHR', 'SNP']))
        assert np.sum(cols_to_keep) == len(annotations)+2
        M_cols_to_keep = ref_ld.drop(columns=['CHR', 'SNP']).columns.str[:-2].isin(annotations)
        assert np.sum(M_cols_to_keep) == len(annotations)
        ref_ld = ref_ld.loc[:, cols_to_keep]
        M_annot = M_annot[:, M_cols_to_keep]
        log.log('Keeping only annotations specified with --anno')

the code

ref_ld.columns.str[:-2]

is a problem because if there are multiple files then the ref_ld matrix will have column names with something like "L2_0" or "L2_1" instead of just L2. This causes a mismatch in the annotation names and the column names in the ref_ld column. I thought I could modify it by just putting a -4 throughout that part but that seems to cause an error with the following function call

overlap_matrix, M_tot = _read_annot(args, log)

It's not clear how I can easily fix this problem. I'm wondering if you have any suggestions for this situation.

Chirag Lakhani

omerwe commented 3 years ago

Hi Chirag,

Thanks for the bug report. I pushed a fix to the GitHub repo. Can you please git pull the latest version and see if it fixes the issue?

cmlakhan commented 3 years ago

Hi Omer,

Thanks for the quick push. I got an error in this block of code

    if args.anno is not None:
        cols_to_keep = np.zeros(len(ref_ld.columns), dtype=np.bool)        
        annotations = args.anno.split(',')
        if np.any(~np.isin(annotations, ref_ld.columns.str[:-2])):
            raise ValueError('Not all annotations specified with --anno are found in the LD scores file')        
        cols_to_keep = (ref_ld.columns.str[:-2].isin(annotations)) | (ref_ld.columns.str[:-4].isin(annotations)) | (ref_ld.columns.isin(['CHR', 'SNP']))
        assert np.sum(cols_to_keep) == len(annotations)+2
        cols_nochrsnp = ref_ld.drop(columns=['CHR', 'SNP']).columns
        M_cols_to_keep = (cols_nochrsnp.str[:-2].isin(annotations)) | (cols_nochrsnp.str[:-4].isin(annotations))
        assert np.sum(M_cols_to_keep) == len(annotations)
        ref_ld = ref_ld.loc[:, cols_to_keep]
        M_annot = M_annot[:, M_cols_to_keep]
        log.log('Keeping only annotations specified with --anno')

because

np.any(~np.isin(annotations, ref_ld.columns.str[:-2]))

should also have the -4. Once I made that changed it seemed to parse correctly but then I got an error when it reached this part of the code

    log.log(hsqhat.summary(ref_ld_cnames, P=args.samp_prev, K=args.pop_prev, overlap = args.overlap_annot))
    if args.overlap_annot:
        overlap_matrix, M_tot = _read_annot(args, log)

I get this error

Error parsing .annot file.

omerwe commented 3 years ago

Hi Chirag,

Sorry for this, I just updated the code again. Can you please git pull and try again? If it still doesn't work, can you please send me either (1) the full command and error message; or (2) examples files that will allow me to run this analysis on my own and replicate the error?

Best,

Omer

cmlakhan commented 3 years ago

Hi Omer it still didn't work. I will post both the command and error below. Later today, I can provide you with a minimal example that produces the error in case you want to hold off until I provide the data.

Command:

bl='/gpfs/commons/groups/knowles_lab/data/ldsc/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.'
rm='/gpfs/commons/home/clakhani/data/annotations/roadmap_annotations/merged_annotations'
exp='/gpfs/commons/home/clakhani/data/annotations/expecto_annotations/annotations_split'
bat='/gpfs/commons/home/clakhani/data/annotations/brain_atac/merged_annotations'
dl_ali='/gpfs/commons/home/clakhani/data/annotations/brain_atac/deep_learning_annotations'

python ldsc.py  --h2 /gpfs/commons/home/clakhani/data/summary_stats/alzheimers/kunkle_2019/AD_Kunkle_etal_Stage1.parquet \
--ref-ld-chr $bl,$rm/roadmap_merged_DNase.chr,$exp/expecto_DNASE_,$dl_ali/dl_brain_atacseq_,$wave/regulo \
--w-ld-chr /gpfs/commons/home/clakhani/data/ldsc_data/weights_hm3_no_hla/weights. \
--overlap-annot \
--out /gpfs/commons/home/clakhani/data/summary_stats/output_dl_models/AD_Kunkle_monocyte_dl \
--overlap-annot \
--not-M-5-50 \
--anno base,Coding_UCSC,Coding_UCSC.flanking.500,Conserved_LindbladToh,Conserved_LindbladToh.flanking.500,CTCF_Hoffman,CTCF_Hoffman.flanking.500,DGF_ENCODE,DGF_ENCODE.flanking.500,DHS_peaks_Trynka,DHS_Trynka,DHS_Trynka.flanking.500,Enhancer_Andersson,Enhancer_Andersson.flanking.500,Enhancer_Hoffman,Enhancer_Hoffman.flanking.500,FetalDHS_Trynka,FetalDHS_Trynka.flanking.500,H3K27ac_Hnisz,H3K27ac_Hnisz.flanking.500,H3K27ac_PGC2,H3K27ac_PGC2.flanking.500,H3K4me1_peaks_Trynka,H3K4me1_Trynka,H3K4me1_Trynka.flanking.500,H3K4me3_peaks_Trynka,H3K4me3_Trynka,H3K4me3_Trynka.flanking.500,H3K9ac_peaks_Trynka,H3K9ac_Trynka,H3K9ac_Trynka.flanking.500,Intron_UCSC,Intron_UCSC.flanking.500,PromoterFlanking_Hoffman,PromoterFlanking_Hoffman.flanking.500,Promoter_UCSC,Promoter_UCSC.flanking.500,Repressed_Hoffman,Repressed_Hoffman.flanking.500,SuperEnhancer_Hnisz,SuperEnhancer_Hnisz.flanking.500,TFBS_ENCODE,TFBS_ENCODE.flanking.500,Transcr_Hoffman,Transcr_Hoffman.flanking.500,TSS_Hoffman,TSS_Hoffman.flanking.500,UTR_3_UCSC,UTR_3_UCSC.flanking.500,UTR_5_UCSC,UTR_5_UCSC.flanking.500,WeakEnhancer_Hoffman,WeakEnhancer_Hoffman.flanking.500,GERP.NS,GERP.RSsup4,MAFbin1,MAFbin2,MAFbin3,MAFbin4,MAFbin5,MAFbin6,MAFbin7,MAFbin8,MAFbin9,MAFbin10,MAF_Adj_Predicted_Allele_Age,MAF_Adj_LLD_AFR,Recomb_Rate_10kb,Nucleotide_Diversity_10kb,Backgrd_Selection_Stat,CpG_Content_50kb,MAF_Adj_ASMC,GTEx_eQTL_MaxCPP,BLUEPRINT_H3K27acQTL_MaxCPP,BLUEPRINT_H3K4me1QTL_MaxCPP,BLUEPRINT_DNA_methylation_MaxCPP,synonymous,non_synonymous,Conserved_Vertebrate_phastCons46way,Conserved_Vertebrate_phastCons46way.flanking.500,Conserved_Mammal_phastCons46way,Conserved_Mammal_phastCons46way.flanking.500,Conserved_Primate_phastCons46way,Conserved_Primate_phastCons46way.flanking.500,BivFlnk,BivFlnk.flanking.500,Human_Promoter_Villar,Human_Promoter_Villar.flanking.500,Human_Enhancer_Villar,Human_Enhancer_Villar.flanking.500,Ancient_Sequence_Age_Human_Promoter,Ancient_Sequence_Age_Human_Promoter.flanking.500,Ancient_Sequence_Age_Human_Enhancer,Ancient_Sequence_Age_Human_Enhancer.flanking.500,Human_Enhancer_Villar_Species_Enhancer_Count,Human_Promoter_Villar_ExAC,Human_Promoter_Villar_ExAC.flanking.500,E124_DNase,1932,dl_monocyte_dnase \
--print-coefficients

Error:

Read regression weight LD Scores for 1242190 SNPs.
After merging with reference panel LD, 1185439 SNPs remain.
After merging with regression SNP LD, 1076621 SNPs remain.
Removed 34 SNPs with chi^2 > 80 (1076587 SNPs remain)
Total Observed scale h2: 0.1007 (0.0241)
Categories: baseL2_0 Coding_UCSCL2_0 Coding_UCSC.flanking.500L2_0 Conserved_LindbladTohL2_0 Conserved_LindbladToh.flanking.500L2_0 CTCF_HoffmanL2_0 CTCF_Hoffman.flanking.500L2_0 DGF_ENCODEL2_0 DGF_ENCODE.flanking.500L2_0 DHS_peaks_TrynkaL2_0 DHS_TrynkaL2_0 DHS_Trynka.flanking.500L2_0 Enhancer_AnderssonL2_0 Enhancer_Andersson.flanking.500L2_0 Enhancer_HoffmanL2_0 Enhancer_Hoffman.flanking.500L2_0 FetalDHS_TrynkaL2_0 FetalDHS_Trynka.flanking.500L2_0 H3K27ac_HniszL2_0 H3K27ac_Hnisz.flanking.500L2_0 H3K27ac_PGC2L2_0 H3K27ac_PGC2.flanking.500L2_0 H3K4me1_peaks_TrynkaL2_0 H3K4me1_TrynkaL2_0 H3K4me1_Trynka.flanking.500L2_0 H3K4me3_peaks_TrynkaL2_0 H3K4me3_TrynkaL2_0 H3K4me3_Trynka.flanking.500L2_0 H3K9ac_peaks_TrynkaL2_0 H3K9ac_TrynkaL2_0 H3K9ac_Trynka.flanking.500L2_0 Intron_UCSCL2_0 Intron_UCSC.flanking.500L2_0 PromoterFlanking_HoffmanL2_0 PromoterFlanking_Hoffman.flanking.500L2_0 Promoter_UCSCL2_0 Promoter_UCSC.flanking.500L2_0 Repressed_HoffmanL2_0 Repressed_Hoffman.flanking.500L2_0 SuperEnhancer_HniszL2_0 SuperEnhancer_Hnisz.flanking.500L2_0 TFBS_ENCODEL2_0 TFBS_ENCODE.flanking.500L2_0 Transcr_HoffmanL2_0 Transcr_Hoffman.flanking.500L2_0 TSS_HoffmanL2_0 TSS_Hoffman.flanking.500L2_0 UTR_3_UCSCL2_0 UTR_3_UCSC.flanking.500L2_0 UTR_5_UCSCL2_0 UTR_5_UCSC.flanking.500L2_0 WeakEnhancer_HoffmanL2_0 WeakEnhancer_Hoffman.flanking.500L2_0 GERP.NSL2_0 GERP.RSsup4L2_0 MAFbin1L2_0 MAFbin2L2_0 MAFbin3L2_0 MAFbin4L2_0 MAFbin5L2_0 MAFbin6L2_0 MAFbin7L2_0 MAFbin8L2_0 MAFbin9L2_0 MAFbin10L2_0 MAF_Adj_Predicted_Allele_AgeL2_0 MAF_Adj_LLD_AFRL2_0 Recomb_Rate_10kbL2_0 Nucleotide_Diversity_10kbL2_0 Backgrd_Selection_StatL2_0 CpG_Content_50kbL2_0 MAF_Adj_ASMCL2_0 GTEx_eQTL_MaxCPPL2_0 BLUEPRINT_H3K27acQTL_MaxCPPL2_0 BLUEPRINT_H3K4me1QTL_MaxCPPL2_0 BLUEPRINT_DNA_methylation_MaxCPPL2_0 synonymousL2_0 non_synonymousL2_0 Conserved_Vertebrate_phastCons46wayL2_0 Conserved_Vertebrate_phastCons46way.flanking.500L2_0 Conserved_Mammal_phastCons46wayL2_0 Conserved_Mammal_phastCons46way.flanking.500L2_0 Conserved_Primate_phastCons46wayL2_0 Conserved_Primate_phastCons46way.flanking.500L2_0 BivFlnkL2_0 BivFlnk.flanking.500L2_0 Human_Promoter_VillarL2_0 Human_Promoter_Villar.flanking.500L2_0 Human_Enhancer_VillarL2_0 Human_Enhancer_Villar.flanking.500L2_0 Ancient_Sequence_Age_Human_PromoterL2_0 Ancient_Sequence_Age_Human_Promoter.flanking.500L2_0 Ancient_Sequence_Age_Human_EnhancerL2_0 Ancient_Sequence_Age_Human_Enhancer.flanking.500L2_0 Human_Enhancer_Villar_Species_Enhancer_CountL2_0 Human_Promoter_Villar_ExACL2_0 Human_Promoter_Villar_ExAC.flanking.500L2_0 E124_DNaseL2_1 1932L2_2 dl_monocyte_dnaseL2_3
Lambda GC: 9.2948e-07
Mean Chi^2: 1.0132e-06
Intercept: 1.0251 (0.0116)
Ratio: NA (mean chi^2 < 1)
Reading annot matrix from /gpfs/commons/groups/knowles_lab/data/ldsc/1000G_Phase3_baselineLD_v2.2_ldscores/baselineLD.,/gpfs/commons/home/clakhani/data/annotations/roadmap_annotations/merged_annotations/roadmap_merged_DNase.chr,/gpfs/commons/home/clakhani/data/annotations/expecto_annotations/annotations_split/expecto_DNASE_,/gpfs/commons/home/clakhani/data/annotations/brain_atac/deep_learning_annotations/dl_brain_atacseq_,/gpfs/commons/groups/knowles_lab/data/ldsc/wavenet/regulo[1-22] ...
  0%|                                                                                                                                 | 0/22 [00:05<?, ?it/s]
Error parsing .annot file.
Traceback (most recent call last):
  File "ldsc.py", line 690, in <module>
    sumstats.estimate_h2(args, log)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 355, in estimate_h2
    overlap_matrix, M_tot = _read_annot(args, log)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 103, in _read_annot
    'annot matrix', ps.annot, frqfile=args.frqfile_chr, anno=annotations)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 160, in _read_chr_split_files
    out = parsefunc(_splitp(chr_arg), _N_CHR, **kwargs)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 278, in annot
    for i, fh in enumerate(fh_list)]
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 278, in <listcomp>
    for i, fh in enumerate(fh_list)]
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 171, in annot_parser
    assert a in df_annot.columns
AssertionError

Analysis finished at Wed Apr 28 09:27:41 2021
Total time elapsed: 3.0m:12.400000000000006s
Traceback (most recent call last):
  File "ldsc.py", line 690, in <module>
    sumstats.estimate_h2(args, log)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 355, in estimate_h2
    overlap_matrix, M_tot = _read_annot(args, log)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 103, in _read_annot
    'annot matrix', ps.annot, frqfile=args.frqfile_chr, anno=annotations)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/sumstats.py", line 160, in _read_chr_split_files
    out = parsefunc(_splitp(chr_arg), _N_CHR, **kwargs)
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 278, in annot
    for i, fh in enumerate(fh_list)]
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 278, in <listcomp>
    for i, fh in enumerate(fh_list)]
  File "/gpfs/commons/home/clakhani/gitProjects/polyfun/ldsc_polyfun/parse.py", line 171, in annot_parser
    assert a in df_annot.columns
AssertionError

cmlakhan commented 3 years ago

Hi Omer,

This minimal example reproduces the error

python ldsc.py  --h2 $sumstats/AD_Kunkle_etal_Stage1.parquet \
--ref-ld-chr $bl/baselineLD.,$dl_annot/dl_brain_atacseq_ \
--w-ld-chr $weights/weights. \
--overlap-annot \
--out $out/AD_Kunkle_monocyte_dl \
--overlap-annot \
--not-M-5-50 \
--anno base,Coding_UCSC,Coding_UCSC.flanking.500,dl_monocyte_dnase \
--print-coefficients

I have included the summary statistics, baselineV2.2 annotation, deep learning annotation, weights files in this file.

You can replace $sumstats, $bl, $dl_annot, $weights, and $out with the appropriate directories for the corresponding files. These are based on the 1000 genomes EUR SNPs.

Link the the file with the appropriate data (it is 1.1GB)

https://www.dropbox.com/s/lpvl7jtawjyig7r/omer.zip?dl=0

omerwe commented 3 years ago

Hi Chirag,

Thanks, this was very useful. I've now fixed the bug and pushed it GitHub. Please git pull and try again.

Thanks,

Omer

cmlakhan commented 3 years ago

Great, thanks a bunch!

omerwe / polyfun

cannot specify specific annotations when using multiple annotation files #45