Open dktanwar opened 4 years ago
code for the heatmap:
e <- readRDS("~/bioinfo/mansuy/irina/overlap_matrix.rds")
e$isProm <- (abs(e$distanceToTSS)<5000)/2+(abs(e$distanceToTSS)<2500)/2
fields <- c("isProm","diffAccessibility-logFC", "RNA_PND8_vs_PND15_logFC",
"RNA_PND15_vs_Adult_logFC","BS_PND7_meth","BS_PND14_meth","BS_PNW8_meth",
grep("ChIP", colnames(e), value=TRUE)
)
e2 <- e[,fields]
e2 <- do.call(cbind, lapply(e2,as.numeric))
b <- SEtools::getBreaks(e2, split.prop = 0.96, 100)
cols <- colorRampPalette(c("blue","black","yellow"))(101)
e2 <- SEtools::sortRows(e2, z = FALSE)
prom <- e2[e2[,1]>0,]
distal <- e2[e2[,1]==0,]
pheatmap(prom, color = cols, breaks = b, cluster_cols=FALSE, cluster_rows = FALSE)
Overlap structure in Irina's understanding, please feel free to correct me/add things:
1st level division of ATAC-seq regions:
Distal regions (more than 5kb away from TSS)
Active promoter (H3K4me3 +)
Inactive promoter (H3K4me3 -)
2nd level division: For the active/inactive promoter groups further categorize based on the direction of the gene expression (RNAseq) vs direction of chromatin accessibility (ATAC-seq) - take a min FC of 20% based on the RNA-seq data
Regions that become more accessible in adulthood and correlate with genes increased in expression
Regions that become more accessible in adulthood and for which the associated genes are not expressed/not changing expression - not sure which of the 2 the N/A standed for? @dktanwar
Regions that become more accessible in adulthood and correlate with genes decreased in expression
Regions that become less accessible in adulthood and correlate with genes decreased in expression
Am I missing any other category from the heatmap? @plger
3rd level division: Split regions further based on the presence of H3K27ac and presence of H3K4me3 (inactive enhancer)/H3K27ac and absence of H3K4me3 (active enhancer) Question: do we do this 3rd level division only for distal regions? @plger
hadn't we said >2.5kb from any TSS?
I just realized that we have H3K4me3 data for only one of the timepoint, so it's unwise to use this as a level 1 classification. Suggestion:
Level 1:
Level 2:
Level 3:
Level 4:
(now 36 sets instead of 32, again with many that will be empty)
Sorry, dunno why I wrote 5kb, stuck to my brain and that was that...at level 3 you mean genes not regions right? Aside from that, yap, the division sounds good to me! Let's see what we get out of it :)
@plger @dktanwar
I have 2 Issues which are unclear to me after starting to look through the output sheets:
According to our 4-level classification, there should be 3 output sheets for level 4:
When looking into the Distal reg_AccessibUp sheet I see different no of regions marked by the histone marks than in Distal reg_AccessibUp_H3K4me3 or the Distal reg_AccessibUp_H3K27ac sheets (I look at the TRUE/FALSE annotation of the histone mark columns) - why is this? Are these histone columns to be ignored in the Distal reg_AccessibUp sheet because the classification is only at level 2 and not 4? If so, why do we have this histone mark columns with TRUE/FALSE in there?
@dktanwar @plger Here is my sum up from today's meeting:
Output excel sheets are only coming from the categorisation that includes the ChIP-seq data, generate also the ones which are only dependent on ATAC-seq/RNA-seq data is useful; Also figured out why we didn't get an output sheet for H3K27me3 (missing "3" in the code);
Rerun GREAT and GO classify on all lists generated from the categorisation which doesn't include/includes ChIP-seq data (minimum 10 genes/list as a threshold); also limit GO sets to 1000 genes instead of using GO levels; (In GO classify heatmap - the numbers refer to the absolute no of genes overlapping with the GO set, the color refers to the proportion of genes out of the whole GO set) (In GREAT analysis use all differentially accessible regions as background)
The 2,5kb from TSS we chose to define Proximal regions are both up/downstream from TSS; However there is an annotation issue in ATAC-seq data analysis and in the data integration: "Downstream" does not reflect the distance to TSS - most probably a package issue;
TF motif enrichment should be run on all the lists from both categorisations with and without ChiP-seq data.
@dktanwar @plger: does this make sense to you, cause in my head it does (biologically the split seems justified and I don't feel I am biasing anything)
GREAT analysis:
Also: I feel the same split should be used also to run TF motif analysis...any thoughts on this?
Done, @Irinalazar !
Morning! @dktanwar @plger Had a look at GREAT and HOMER, here are my comments:
Comments on GREAT analysis:
The LogFC threshold of 1 for overlapping differential gene expression is too high for running GREAT/ any enrichment analysis in my opinion: I would suggest 0.25 (as seen in similar publications from Cell on spermatogonia) and re-run GREAT on these lists. For example: with LogFC of at least 1 you loose 50 genes in a list, with 0.25 you loose 5…
Was any gene expression threshold used also for the distal regions? I don't think we should trim out regions based on RNA-seq FDR/LogFC for the distal regions...I would include all regions, regardless of the closest gene expression (that gene is anyways tens of kb away from the region so we cannot assume with high confidence that that's the only gene impacted by it, so why exclude other regions for which the closest gene identified doesn't have a statistically different expression between PND15 and adults). The thing we are most interested there is if for those 100 regions which overlap with ChIPseq data there is something coming out of GREAT...
How does the choice of background impact the analysis: differentially accessible regions vs all regions identified by ATAC-seq that pass the quality threshold? (So basically 3000 something vs 100k something). I know using a different background answers a different question, should we try to run both?
Maybe it does make sense to also run a GO enrichment parallel with great and compare results (For proximal regions only, where we have a gene TSS close by)? Choose background for GO enrichment - All differentially expressed genes vs all expressed genes
Comments on HOMER analysis
I think the same threshold of LogFC at least 1 has been used, again I think it’s just too high…I also have the feeling a gene expression threshold has also been used for selecting distal regions, which I wouldn’t really use, given these regions could potentially act on more than that 1 gene which is mapped as closest (but still tens of kb away from the region). I would rerun the HOMER analysis as well, with lower LogFC threshold for proximal regions and no RNAseq threshold for distal regions (same suggestion as for GREAT)
The choice of background: would run it with a different background as well, the total no of identified peaks in the ATACseq, not only the differentially expressed ones as we ran it here
Why all these comments: because at the moment, the way we ran these 2 analyses doesn't give me much to work with...hence I would like to explore alternatives :)
I don't know if it's possible to re-run these 2 until tomorrow, I could have a look for sure if you do @dktanwar , otherwise we can discuss these suggestions in our meeting before re-runing GREAT and HOMER
I would first discuss it and then run the analysis. Also because the analysis will not be completed by tomorrow. If all agree, I can run over the weekend!
@dktanwar
[x] Re-run GREAT on lists with/without ChIP without gene expression LogFC and q-Value thresholds using differentially accessible regions as background
[x] Run GREAT on lists with/without ChIP without gene expression LogFC and q-value thresholds using all regions as background
[x] Re-run HOMER on lists with/without ChIP without gene expression LogFC and q-Value thresholds using differentially accessible regions as background
[x] Run HOMER on lists with/without ChIP without gene expression LogFC and q-value thresholds using all regions as background
[x] Remove small lists (less than 20 entries) from GO classify
Thank you!
@Irinalazar
Everything is done!
goClassify
and GREAT
results are available. HOMER
results will be available by Monday/ Tuesday (currently running).
You should look for the results in the same directories (mentioned before). For a description of folders, see the issue https://github.com/mansuylab/SC_controls/issues/28 (I updated it)
Data overlap
Generate a
data overlap matrix
. Overlapping to theATAC-Seq differential accessible regions
[x] ChIP-Seq:
TRUE
orFALSE
of there is an overlap between ATAC-Seq and ChIP-Seq peaks[x] BS-Seq: Mean of overlapping CpGs
[x] RNA-Seq: Genes expression, logFC and qval for Genes around +/- 5kb TSS of peaks
Sub-divide regions
[x] Sub-divide regions as per https://github.com/mansuylab/SC_controls/issues/28#issuecomment-636073293
[x] Make heatmap of each division
GO classify analysis
GO analysis using
GREAT
lfc
> 1 &qval
< 0.05)With background as differential accessible regions
With background as all tested regions
With background as differential accessible regions
With background as all tested regions
TF analysis using
Homer
lfc
> 1 &qval
< 0.05)With background as differential accessible regions
With background as all tested regions
With background as differential accessible regions
With background as all tested regions