Construct TF-CRE-Gene Regulon by combine all genomic features

lhqing commented 2 years ago

TF-CRE-Gene Regulon

TF - CRE link See motif enrichment analysis below
- [x] Scan motifs ecker-hanqing-analysis/220924-dmr-motif-scan cemba.get_mc_dmr_ds(add_motif=True) # to get DMR RegioDS with motif scan matrix
- [x] perform motif enrichment for each cluster and meaningful cluster groups
- [x] Create graph adjacency matrix for TF-CRE
Gene - CRE link
- [ ] correlation
- [ ] GBT model predictability
- [x] 3C physical approximation #16
- [ ] Create graph adjacency matrix for Gene-CRE
Gene - TF link
- [x] correlation
- [ ] GBT model predictability
- [ ] Create adjacency matrix for Gene-TF
Construct Regulon
- [ ] link eRegulon by raw adjacency matrix
- [ ] filter by GSEA Leading edge analysis
- [ ] quantification in each cluster
- [ ] QC regulon

lhqing commented 2 years ago

Motif Enrichment Analysis

Use pycistarget motif collection (4096 motif / motif clusters with a mouse TF annotation) to scan DMR regions (slop -b 150), resulting a motif-by-dmr dataset, can be loaded with DMR RegionDS
Determine region sets to run motif enrichment analysis min_cov = 5; quantile = 0.1 a. for each sample, with cov > min_cov filter, choose top and bottom quantile DMR as the sample-based hypo- and hyper-DMR, hypo-DMR status -1, hyper-DMR status 1, no status 0 b. for each dmr, with cov > min_cov filter, choose top and bottom quantile DMR as the sample-based hypo- and hyper-DMR, hypo-DMR status -1, hyper-DMR status 1, no status 0 c. combine hypo- and hyper- status from a and b, only save consistent status when a == b, otherwise save 0 d. sample can be all the L4Regions, and the SubCluster labels
For each set identified from 2, run motif enrichment with DME and CisTarget methods, take the consistent hypo- or hyper- enriched motifs
Combine all results to get cistrome of all TFs by all regions

After complete comments

CisTarget Hypo enrichment show good cell type specificity, hyper has little enrichment; DEM does not work well, motifs uniformly enriched, no clear cell type specificity. The final motif hits will be cistarget-hypo-DMR motif hits.

TF Example

tf_nes = dmr_ds["tf_nes"].to_pandas() # use NES per TF, agg from max(motifs) download cemba.get_gene_fracs() # TF mCH frac

How to get results

dmr_ds = cemba.get_mc_dmr_ds(add_motif_hits=True)

# final TF-by-DMR hits (or adj matrix for the TF DMR graph)
dmr_ds['tf_gene_hits']

rachelzeng98 commented 2 years ago

TF_GENE correlation

get l4region TF and gene expression matrix:

a. get tf name from mm10.get_tf_gene_table()
b. get mc fraction matrix by add_mc_frac to wmb.cemba.CEMBA_SNMC_CLUSTER_L4Region_SUM_ZARR_PATH
c. output: tf-by_l4region matrix and gene_by_l4region matrix mc fraction matrix

correlation: 1838 TF, 4673 l4region and 30370 gene

a. pearson correlation between each gene and each tf
b. null distribution: shuffle sample order in tf-by_l4region matrix and do pearson correlation
c. use a filter cutoff = 0.3 to judge if tf and gene have a connection
c. output: tf_by_gene correlation matrix and adjacency matrix

How is the result

histplot showing pearsoncorrelation value distribution
scatterplot showing TF and Gene expression
TF and Gene expression plot on l4region clusters

How to get results combine all results to zarr: ecker-rachel-analysis/tf_gene/TF_Gene_Corelation.zarr

lhqing / whole_mouse_brain