snehamitra / SCARlink

33 stars 8 forks source link

Comparing peak-to-gene linkages across cell types? #14

Open Jia340 opened 1 month ago

Jia340 commented 1 month ago

Hi authors,

I am wondering what metric could be used to compare the same peak-to-gene linkage across different cell types? According to the tutorial, "the Shapley values converted to z-score are an estimate of predicted gene-tile linkage". On the other hand, p-values and FDRs are "more appropriate for ordering the gene-linked tiles across all genes and celltypes".

I have several cases where the same linkage can pass FDR<1e-3 and z-score > 0.05 threshold in two cell types. In one example, there's a known significant differential peak higher in celltype1 vs celltype2. In celltype1, z.score = 4.6 and FDR = 4e-5, in celltype2, z.score = 0.07 and FDR = 2.5e-9. On visualization plot, the blue dot is much darker in condition 1. However the FDR is more significant in celltype2.

I just wonder (1) which metric should be used for comparison here, and (2) if z-score can be used, can I use the difference between z-score to indicate the difference between linkage strength in celltype1 and 2?

Thank you!

snehamitra commented 1 month ago

The FDR significance is largely driven by how accessible the tile is. We found FDR to be useful when ranking gene-linked tiles across different genes.

But the z-scores are useful for comparing gene-linked tiles for the same gene across cell types. Based on the values shared by you, I would assume the tile is accessible in both celltype1 and celltype2 but it is predicted to be a strongly linked tile in celltype1 with high z-score. It would make sense to use the difference between z-scores to quantify the linkage strength.

Jia340 commented 3 weeks ago

Thank you for the response! I have a follow-up question: when I compare gene-linked tiles for the same gene across cell types/conditions, I observed that some regions could have large z-scores but FDR = 1 (please see below, cond1 and cond2 are two conditions I am comparing). It indicates that this site is not necessary when predicting target gene expression (removing this site won't change the predicted expression level), but it's still a strong peak linked to the gene. The sites with FDR=1 indeed have lower ATAC signal, but they are still obvious peak regions.

My question is (1) how should I interpret the discrepancy between the FDR and the z-score, and (2) if my goal is comparing two conditions and see if a site/peak has a stronger link to one gene in any condition, is the FDR filtering still necessary?

Screenshot 2024-08-16 at 12 46 05 PM
snehamitra commented 1 week ago

Since the FDR is sensitive to the accessibility of the tile, it can be 1 if the overall accessibility in the tile is too sparse. In such cases, even if the z-score is high, it could be a potential false positive.

For instance, in the following example data set, the tiles with FDR=1 are more sparse compared to tiles with FDR < 0.05.

>>> df[(df['FDR'] == 1)]['test_acc_sparsity'].mean()
0.0035214338560441245
>>> df[(df['z-score'] > 1) & (df['FDR'] == 1)]['test_acc_sparsity'].mean()
0.006309513875421772
>>> df[(df['z-score'] > 1) & (df['FDR'] < 0.05)]['test_acc_sparsity'].mean()
0.08522307286150335

You can use the z-scores to compare the strength of the gene-linked tiles but filtering on FDR would allow you to consider tiles that are not too sparse.