kkdey / GSSG

Gene Set + S2G strategy annotations analyzed for disease architecture
45 stars 12 forks source link

Several quetions: NMF, Escore to p value, MAGMA gene prioritization #27

Open KoichiHashikawa opened 1 year ago

KoichiHashikawa commented 1 year ago

Hi @kkdey,

I have following questions on GSSG pipeline. It is great if I can hear when you have a chance.

  1. Regarding the generation of gene program .txt file from .csv. https://github.com/kkdey/GSSG STEP1. This seems compatible with converting .csv files generated at https://github.com/karthikj89/scgenetics, but not with NMF .csv files. Do you have the NMF version of GSSG?

  2. Escore and pvalue. We have computed Escores using the same scdata. I wonder how the p-values are computed from here. The postprocessing LDSC (postprocess_ldsc_sclinker.R) produced table for E value and p.E. Initially, I thought this p.E is the p-value reported in the paper, but the values are too small and this may not be the one. It helps if there are any scripts for p-value computations for each gene program.

Another question is regarding E-value computation. Our E-value/tau results slightly differs from the one found in https://alkesgroup.broadinstitute.org/LDSCORE/Jagadeesh_Dey_sclinker/sclinker_results/ I wonder 1) if list of summary stats data in sumstats_use.txt file could affect the results. Do I need to include all files in https://alkesgroup.broadinstitute.org/LDSCORE/all_sumstats/ or can I choose only the traits of interests (e.g. in IBD case: PASS_CD_deLange2017.sumstats PASS_IBD.sumstats PASS_IBD_deLange2017.sumstats PASS_UC_deLange2017.sumstats PASS_Ulcerative_Colitis.sumstats vs include all sumstats) 2) if the selection of tissue could affect the results. e.g. enhancer_tissue="BRN" For instance, for UC data, it seems you compute both BLD and GI versions. Does the LDSC results differ if I only choose GI? 3) which annotations to use? our results give "100kb","ABC_Road_ALL", and "ABC_Road_GI" for UC where your results have "ABC_Road_I_BLD" and "ABC_Road_I_GI". Should I use E-values in "ABC_Road_GI", not "100kb? Also again, do I need to compute BLD? 4) which trait to use when you have multiple stats summary for the same indication? (e.g. PASS_UC_deLange2017, PASS_Ulcerative_Colitis. Those two gave slightly different results. Is the latter originally curated at Broad?)

  1. I do not see the scripts of MAGMA for gene prioritization for the gene programs computed by sc-linker. Do you have scripts using MAGMA? In addition, I was not entirely sure by reading the paper, but is MAGMA used to determine the top driving genes in data_file_S4? How these top driving genes are computed?

Related to this, I wonder how we can go from E-score and p-value of cell type programs.

  1. Is sc-linker capable of identifying a subset of genes in the gene program, which are most likely affected by the SNP in the enhancer regions?
  2. can we infer which gene variants in the enhancer are affecting the expression of gene programs?
  3. or sc-linker is primarily determining gene programs (derived from single cell RNAseq data), which are regulated by the disease-SNP in enhancer regions?

Thank you so much for your time! Koichi

kkdey commented 1 year ago

@KoichiHashikawa

  1. The format of the .txt file is a general file format for any gene program. That gene program can come from cell-type program, disease progression, NMF or anything else (like PPI-Enhancer for Dey et al 2022 Cell Genomics). The NMF .csv file may have multiple columns for multiple factors, you can extract the columns into these separate .txt files (check the paper Methods section as to how we define the NMF gene program file).

  2. https://github.com/kkdey/GSSG#calculating-e-scores : see here for how to calculate the E-score anf p.Escores. The postprocess_ldsc_sclinker.R looks at meta-analyzed tau-star and enrichment, which is different from the Escore and p.Escore.

  3. The list of traits in that folder is much bigger than the 60 relatively independent traits we use (this is used in other papers outside of sc-linker). You can find the list of traits we use from the Supplementary Tables in the sclinker paper.

  4. LDSC results will be different for different tissue enhancer links.

  5. Use ABC_Road_GI in general. That is the sclinker model. The 100kb corresponds to the SEG model (Finucane et al 2018 Nat Genet) which we show to be inferior to ABC_Road_GI.

  6. You can check from the Supplementary Table, which one we used. Yes, there can be differences between these two sumstats.

  7. MAGMA is a standalone software (https://ctg.cncr.nl/software/magma). If you are looking for MAGMA gene-level z scores, you can check the MAGMA z-score file under processed_data.

  8. sc-linker is primarily doing (3) determining gene programs (derived from single cell RNAseq data), which are regulated by the disease-SNP in enhancer regions, but sclinker + MAGMA also does (1). (2) may be less obvious.

KoichiHashikawa commented 1 year ago

@kkdey Thanks so much for the detailed answers! I will compute E score, p.Escore and MAGMA to see consistency with publication.

KoichiHashikawa commented 1 year ago

@kkdey Thanks again for the helps. I could now replicate the E values of disease programs.

I still have some to clarify.

  1. sE-score (program, trait) := sqrt(S1^2 + S2^2 + 2cor(SNP annotation from Gene Program x Enhancer-gene strategy, SNP annotation from All protein-coding genes x Enhancer-gene)S1S2) Can you elaborate 2cor(SNP annotation from Gene Program x Enhancer-gene strategy, SNP annotation from All protein-coding genes x Enhancer-gene)S1S2 part? my understanding is: S1: se(E) in a given cell type of a given trait S2: se(E) in ALL of a given trait cor(SNP annotation from Gene Program x Enhancer-gene strategy, SNP annotation from All protein-coding genes x Enhancer-gene) is this the correlation between se(E) in all traits in a given cell type and se(E) in all traits in ALL? you still need to multiply this corS1S2?

  2. pvalue One E-score and sE-score are calcualted, zE-score :=. E-score/sE-score to compute z score and then pnorm(zE-score , 0, 1) to compute pvalue? In your supplementary table S2, many pEscore is below 0.05. Does this mean that many cell type are significantly associated with disease?

  3. MAGMA To generate MAGMA-top driving genes, do you use all significant genes in the gene program as an input without weight?

zyfxxqw commented 1 year ago

@KoichiHashikawa I benefited from your questions,Can you tell me the exact code for Escore calculation?

ljhklcy commented 6 months ago

@KoichiHashikawa , Have you figured out how to compute the sE-score? And I think "zE-score :=. E-score/sE-score" should actually be "zE-score=E-score/sE-score"? I'd appreciate it a lot if you may kindly help to teach me how to calculate these values or kindly help to share the codes to calculate the two.