dktanwar commented 4 years ago

Data overlap

Generate a `data overlap matrix`. Overlapping to the `ATAC-Seq differential accessible regions`

[x] ChIP-Seq: TRUE or FALSE of there is an overlap between ATAC-Seq and ChIP-Seq peaks
[x] BS-Seq: Mean of overlapping CpGs
[x] RNA-Seq: Genes expression, logFC and qval for Genes around +/- 5kb TSS of peaks

Sub-divide regions

[x] Sub-divide regions as per https://github.com/mansuylab/SC_controls/issues/28#issuecomment-636073293
[x] Make heatmap of each division

GO classify analysis

[x] GO classify analysis on each category of region

GO analysis using `GREAT`

no_ChIP: Splitting without ChIP data
with_ChIP: Splitting with ChIP data
_rna: Also splitting if gene expression is present from RNA-Seq
_rna_cutoff: Also splitting based on significance from RNA-Seq (lfc > 1 & qval < 0.05)
_split: Also splitting based on biotypes (intergenic, intron, etc.)
_bg_all: Using all tested regions as background (~150,000 regions)
_AnnoMerged: Looking together introns and and exons. Also, TSS and proximal 1kbp regions.

With background as differential accessible regions

[x] no_ChIP
[x] no_ChIP_rna
[x] no_ChIP_rna_cutoff
[x] no_ChIP_split
[x] no_ChIP_split_rna
[x] no_ChIP_split_rna_cutoff

With background as all tested regions

[x] no_ChIP_bg_all
[x] no_ChIP_rna_bg_all
[x] no_ChIP_rna_cutoff_bg_all
[x] no_ChIP_split_bg_all
[x] no_ChIP_split_rna_bg_all
[x] no_ChIP_split_rna_cutoff_bg_all

With background as differential accessible regions

[x] with_ChIP
[x] with_ChIP_rna
[x] with_ChIP_rna_cutoff
[x] with_ChIP_split
[x] with_ChIP_split_rna
[x] with_ChIP_split_rna_cutoff

With background as all tested regions

[x] with_ChIP_bg_all
[x] with_ChIP_rna_bg_all
[x] with_ChIP_rna_cutoff_bg_all
[x] with_ChIP_split_bg_all
[x] with_ChIP_split_rna_bg_all
[x] with_ChIP_split_rna_cutoff_bg_all

TF analysis using `Homer`

no_ChIP: Splitting without ChIP data
with_ChIP: Splitting with ChIP data
_rna: Also splitting if gene expression is present from RNA-Seq
_rna_cutoff: Also splitting based on significance from RNA-Seq (lfc > 1 & qval < 0.05)
_split: Also splitting based on biotypes (intergenic, intron, etc.)
_bg_all: Using all tested regions as background (~150,000 regions)
_AnnoMerged: Looking together introns and and exons. Also, TSS and proximal 1kbp regions.

With background as differential accessible regions

[x] no_ChIP
[x] no_ChIP_rna
[x] no_ChIP_rna_cutoff
[x] no_ChIP_split
[x] no_ChIP_split_rna
[x] no_ChIP_split_rna_cutoff

With background as all tested regions

[x] no_ChIP_bg_all
[x] no_ChIP_rna_bg_all
[x] no_ChIP_rna_cutoff_bg_all
[x] no_ChIP_split_bg_all
[x] no_ChIP_split_rna_bg_all
[x] no_ChIP_split_rna_cutoff_bg_all

With background as differential accessible regions

[x] with_ChIP
[x] with_ChIP_rna
[x] with_ChIP_rna_cutoff
[x] with_ChIP_split
[x] with_ChIP_split_rna
[x] with_ChIP_split_rna_cutoff

With background as all tested regions

[x] with_ChIP_bg_all
[x] with_ChIP_rna_bg_all
[x] with_ChIP_rna_cutoff_bg_all
[x] with_ChIP_split_bg_all
[x] with_ChIP_split_rna_bg_all
[x] with_ChIP_split_rna_cutoff_bg_all

plger commented 4 years ago

code for the heatmap:

e <- readRDS("~/bioinfo/mansuy/irina/overlap_matrix.rds")
e$isProm <- (abs(e$distanceToTSS)<5000)/2+(abs(e$distanceToTSS)<2500)/2

fields <- c("isProm","diffAccessibility-logFC", "RNA_PND8_vs_PND15_logFC",
            "RNA_PND15_vs_Adult_logFC","BS_PND7_meth","BS_PND14_meth","BS_PNW8_meth",
            grep("ChIP", colnames(e), value=TRUE)
            )
e2 <- e[,fields]
e2 <- do.call(cbind, lapply(e2,as.numeric))

b <- SEtools::getBreaks(e2, split.prop = 0.96, 100)
cols <- colorRampPalette(c("blue","black","yellow"))(101)
e2 <- SEtools::sortRows(e2, z = FALSE)

prom <- e2[e2[,1]>0,]
distal <- e2[e2[,1]==0,]

pheatmap(prom, color = cols, breaks = b, cluster_cols=FALSE, cluster_rows = FALSE)

Irinalazar commented 4 years ago

Overlap structure in Irina's understanding, please feel free to correct me/add things:

1st level division of ATAC-seq regions:

Distal regions (more than 5kb away from TSS)
Active promoter (H3K4me3 +)
Inactive promoter (H3K4me3 -)

2nd level division: For the active/inactive promoter groups further categorize based on the direction of the gene expression (RNAseq) vs direction of chromatin accessibility (ATAC-seq) - take a min FC of 20% based on the RNA-seq data

Regions that become more accessible in adulthood and correlate with genes increased in expression
Regions that become more accessible in adulthood and for which the associated genes are not expressed/not changing expression - not sure which of the 2 the N/A standed for? @dktanwar
Regions that become more accessible in adulthood and correlate with genes decreased in expression
Regions that become less accessible in adulthood and correlate with genes decreased in expression

Am I missing any other category from the heatmap? @plger

3rd level division: Split regions further based on the presence of H3K27ac and presence of H3K4me3 (inactive enhancer)/H3K27ac and absence of H3K4me3 (active enhancer) Question: do we do this 3rd level division only for distal regions? @plger

plger commented 4 years ago

hadn't we said >2.5kb from any TSS?

I just realized that we have H3K4me3 data for only one of the timepoint, so it's unwise to use this as a level 1 classification. Suggestion:

Level 1:

Distal regions (hadn't we said >2.5kb from any TSS?)
Proximal active: Regions proximal (<2.5kb) to TSS of active genes (i.e. genes that pass filtering in the RNAseq data)
Proximal inactive: Regions proximal (<2.5kb) to TSS of inactive genes

Level 2:

accUp: Regions that become more accessible in adulthood
accDown:Regions that become less accessible in adulthood

Level 3:

RNAup: Regions that increase in expression in adulthood
RNAdown:Regions that decrease in expression in adulthood

Level 4:

Regions that are H3K4me3
Regions that are H3K27ac and not H3K27me or H3K4me3
Regions that are H3K27me and not H3K27ac or H3K4me3

plger commented 4 years ago

(now 36 sets instead of 32, again with many that will be empty)

Irinalazar commented 4 years ago

Sorry, dunno why I wrote 5kb, stuck to my brain and that was that...at level 3 you mean genes not regions right? Aside from that, yap, the division sounds good to me! Let's see what we get out of it :)

Irinalazar commented 4 years ago

@plger @dktanwar

I have 2 Issues which are unclear to me after starting to look through the output sheets:

According to our 4-level classification, there should be 3 output sheets for level 4:
- Regions that are H3K4me3
- Regions that are H3K27ac and not H3K27me or H3K4me3
- Regions that are H3K27me3 and not H3K27ac or H3K4me3 - I am missing this one in the output sheets I find on the server - why?
When looking into the Distal reg_AccessibUp sheet I see different no of regions marked by the histone marks than in Distal reg_AccessibUp_H3K4me3 or the Distal reg_AccessibUp_H3K27ac sheets (I look at the TRUE/FALSE annotation of the histone mark columns) - why is this? Are these histone columns to be ignored in the Distal reg_AccessibUp sheet because the classification is only at level 2 and not 4? If so, why do we have this histone mark columns with TRUE/FALSE in there?

Irinalazar commented 4 years ago

@dktanwar @plger Here is my sum up from today's meeting:

Output excel sheets are only coming from the categorisation that includes the ChIP-seq data, generate also the ones which are only dependent on ATAC-seq/RNA-seq data is useful; Also figured out why we didn't get an output sheet for H3K27me3 (missing "3" in the code);
Rerun GREAT and GO classify on all lists generated from the categorisation which doesn't include/includes ChIP-seq data (minimum 10 genes/list as a threshold); also limit GO sets to 1000 genes instead of using GO levels; (In GO classify heatmap - the numbers refer to the absolute no of genes overlapping with the GO set, the color refers to the proportion of genes out of the whole GO set) (In GREAT analysis use all differentially accessible regions as background)
The 2,5kb from TSS we chose to define Proximal regions are both up/downstream from TSS; However there is an annotation issue in ATAC-seq data analysis and in the data integration: "Downstream" does not reflect the distance to TSS - most probably a package issue;
TF motif enrichment should be run on all the lists from both categorisations with and without ChiP-seq data.

Irinalazar commented 4 years ago

@dktanwar @plger: does this make sense to you, cause in my head it does (biologically the split seems justified and I don't feel I am biasing anything)

GREAT analysis:

For "proximal active regions" lists:

[x] Run GREAT on the regions for which gene expression also is significantly altered (FDR≤0.05) - so basically a subset of the list (most of the genes are actually significantly altered)

For "distal regions" lists:

[x] Split the list in "distal intergenic" vs others (exons, introns, downstream, 3'/5'-UTR) and then run GREAT (regardless of any gene FDR) - I want to see if we get any GO terms specifically for regions which change in accessibility in distal intergenic positions, than the ones which map to gene bodies

Also: I feel the same split should be used also to run TF motif analysis...any thoughts on this?

dktanwar commented 4 years ago

Done, @Irinalazar !

Irinalazar commented 4 years ago

Morning! @dktanwar @plger Had a look at GREAT and HOMER, here are my comments:

Comments on GREAT analysis:

The LogFC threshold of 1 for overlapping differential gene expression is too high for running GREAT/ any enrichment analysis in my opinion: I would suggest 0.25 (as seen in similar publications from Cell on spermatogonia) and re-run GREAT on these lists. For example: with LogFC of at least 1 you loose 50 genes in a list, with 0.25 you loose 5…
Was any gene expression threshold used also for the distal regions? I don't think we should trim out regions based on RNA-seq FDR/LogFC for the distal regions...I would include all regions, regardless of the closest gene expression (that gene is anyways tens of kb away from the region so we cannot assume with high confidence that that's the only gene impacted by it, so why exclude other regions for which the closest gene identified doesn't have a statistically different expression between PND15 and adults). The thing we are most interested there is if for those 100 regions which overlap with ChIPseq data there is something coming out of GREAT...
How does the choice of background impact the analysis: differentially accessible regions vs all regions identified by ATAC-seq that pass the quality threshold? (So basically 3000 something vs 100k something). I know using a different background answers a different question, should we try to run both?
Maybe it does make sense to also run a GO enrichment parallel with great and compare results (For proximal regions only, where we have a gene TSS close by)? Choose background for GO enrichment - All differentially expressed genes vs all expressed genes

Comments on HOMER analysis

I think the same threshold of LogFC at least 1 has been used, again I think it’s just too high…I also have the feeling a gene expression threshold has also been used for selecting distal regions, which I wouldn’t really use, given these regions could potentially act on more than that 1 gene which is mapped as closest (but still tens of kb away from the region). I would rerun the HOMER analysis as well, with lower LogFC threshold for proximal regions and no RNAseq threshold for distal regions (same suggestion as for GREAT)
The choice of background: would run it with a different background as well, the total no of identified peaks in the ATACseq, not only the differentially expressed ones as we ran it here

Why all these comments: because at the moment, the way we ran these 2 analyses doesn't give me much to work with...hence I would like to explore alternatives :)

I don't know if it's possible to re-run these 2 until tomorrow, I could have a look for sure if you do @dktanwar , otherwise we can discuss these suggestions in our meeting before re-runing GREAT and HOMER

dktanwar commented 4 years ago

I would first discuss it and then run the analysis. Also because the analysis will not be completed by tomorrow. If all agree, I can run over the weekend!

Irinalazar commented 4 years ago

@dktanwar

[x] Re-run GREAT on lists with/without ChIP without gene expression LogFC and q-Value thresholds using differentially accessible regions as background
[x] Run GREAT on lists with/without ChIP without gene expression LogFC and q-value thresholds using all regions as background
[x] Re-run HOMER on lists with/without ChIP without gene expression LogFC and q-Value thresholds using differentially accessible regions as background
[x] Run HOMER on lists with/without ChIP without gene expression LogFC and q-value thresholds using all regions as background
[x] Remove small lists (less than 20 entries) from GO classify

Thank you!

dktanwar commented 4 years ago

@Irinalazar

Everything is done!

goClassify and GREAT results are available. HOMER results will be available by Monday/ Tuesday (currently running).

You should look for the results in the same directories (mentioned before). For a description of folders, see the issue https://github.com/mansuylab/SC_controls/issues/28 (I updated it)

mansuylab / SC_postnatal_adult

Data overlap matrix #28

Data overlap

Generate a `data overlap matrix`. Overlapping to the `ATAC-Seq differential accessible regions`

Sub-divide regions

GO classify analysis

GO analysis using `GREAT`

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

TF analysis using `Homer`

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

mansuylab / SC_postnatal_adult

Data overlap matrix #28

Data overlap

Generate a data overlap matrix. Overlapping to the ATAC-Seq differential accessible regions

Sub-divide regions

GO classify analysis

GO analysis using GREAT

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

TF analysis using Homer

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

Generate a `data overlap matrix`. Overlapping to the `ATAC-Seq differential accessible regions`

GO analysis using `GREAT`

TF analysis using `Homer`