mansuylab / SC_postnatal_adult

GNU General Public License v3.0
1 stars 0 forks source link

Data overlap matrix #28

Open dktanwar opened 4 years ago

dktanwar commented 4 years ago

Data overlap

Generate a data overlap matrix. Overlapping to the ATAC-Seq differential accessible regions

Sub-divide regions

GO classify analysis

GO analysis using GREAT

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

TF analysis using Homer

With background as differential accessible regions

With background as all tested regions

With background as differential accessible regions

With background as all tested regions

plger commented 4 years ago

code for the heatmap:

e <- readRDS("~/bioinfo/mansuy/irina/overlap_matrix.rds")
e$isProm <- (abs(e$distanceToTSS)<5000)/2+(abs(e$distanceToTSS)<2500)/2

fields <- c("isProm","diffAccessibility-logFC", "RNA_PND8_vs_PND15_logFC",
            "RNA_PND15_vs_Adult_logFC","BS_PND7_meth","BS_PND14_meth","BS_PNW8_meth",
            grep("ChIP", colnames(e), value=TRUE)
            )
e2 <- e[,fields]
e2 <- do.call(cbind, lapply(e2,as.numeric))

b <- SEtools::getBreaks(e2, split.prop = 0.96, 100)
cols <- colorRampPalette(c("blue","black","yellow"))(101)
e2 <- SEtools::sortRows(e2, z = FALSE)

prom <- e2[e2[,1]>0,]
distal <- e2[e2[,1]==0,]

pheatmap(prom, color = cols, breaks = b, cluster_cols=FALSE, cluster_rows = FALSE)
Irinalazar commented 4 years ago

Overlap structure in Irina's understanding, please feel free to correct me/add things:

1st level division of ATAC-seq regions:

2nd level division: For the active/inactive promoter groups further categorize based on the direction of the gene expression (RNAseq) vs direction of chromatin accessibility (ATAC-seq) - take a min FC of 20% based on the RNA-seq data

Am I missing any other category from the heatmap? @plger

3rd level division: Split regions further based on the presence of H3K27ac and presence of H3K4me3 (inactive enhancer)/H3K27ac and absence of H3K4me3 (active enhancer) Question: do we do this 3rd level division only for distal regions? @plger

plger commented 4 years ago

hadn't we said >2.5kb from any TSS?

I just realized that we have H3K4me3 data for only one of the timepoint, so it's unwise to use this as a level 1 classification. Suggestion:

Level 1:

Level 2:

Level 3:

Level 4:

plger commented 4 years ago

(now 36 sets instead of 32, again with many that will be empty)

Irinalazar commented 4 years ago

Sorry, dunno why I wrote 5kb, stuck to my brain and that was that...at level 3 you mean genes not regions right? Aside from that, yap, the division sounds good to me! Let's see what we get out of it :)

Irinalazar commented 4 years ago

@plger @dktanwar

I have 2 Issues which are unclear to me after starting to look through the output sheets:

  1. According to our 4-level classification, there should be 3 output sheets for level 4:

    • Regions that are H3K4me3
    • Regions that are H3K27ac and not H3K27me or H3K4me3
    • Regions that are H3K27me3 and not H3K27ac or H3K4me3 - I am missing this one in the output sheets I find on the server - why?
  2. When looking into the Distal reg_AccessibUp sheet I see different no of regions marked by the histone marks than in Distal reg_AccessibUp_H3K4me3 or the Distal reg_AccessibUp_H3K27ac sheets (I look at the TRUE/FALSE annotation of the histone mark columns) - why is this? Are these histone columns to be ignored in the Distal reg_AccessibUp sheet because the classification is only at level 2 and not 4? If so, why do we have this histone mark columns with TRUE/FALSE in there?

Irinalazar commented 4 years ago

@dktanwar @plger Here is my sum up from today's meeting:

  1. Output excel sheets are only coming from the categorisation that includes the ChIP-seq data, generate also the ones which are only dependent on ATAC-seq/RNA-seq data is useful; Also figured out why we didn't get an output sheet for H3K27me3 (missing "3" in the code);

  2. Rerun GREAT and GO classify on all lists generated from the categorisation which doesn't include/includes ChIP-seq data (minimum 10 genes/list as a threshold); also limit GO sets to 1000 genes instead of using GO levels; (In GO classify heatmap - the numbers refer to the absolute no of genes overlapping with the GO set, the color refers to the proportion of genes out of the whole GO set) (In GREAT analysis use all differentially accessible regions as background)

  3. The 2,5kb from TSS we chose to define Proximal regions are both up/downstream from TSS; However there is an annotation issue in ATAC-seq data analysis and in the data integration: "Downstream" does not reflect the distance to TSS - most probably a package issue;

  4. TF motif enrichment should be run on all the lists from both categorisations with and without ChiP-seq data.

Irinalazar commented 4 years ago

@dktanwar @plger: does this make sense to you, cause in my head it does (biologically the split seems justified and I don't feel I am biasing anything)

GREAT analysis:

  1. For "proximal active regions" lists:
  1. For "distal regions" lists:

Also: I feel the same split should be used also to run TF motif analysis...any thoughts on this?

dktanwar commented 4 years ago

Done, @Irinalazar !

Irinalazar commented 4 years ago

Morning! @dktanwar @plger Had a look at GREAT and HOMER, here are my comments:

Comments on GREAT analysis:

Comments on HOMER analysis

Why all these comments: because at the moment, the way we ran these 2 analyses doesn't give me much to work with...hence I would like to explore alternatives :)

I don't know if it's possible to re-run these 2 until tomorrow, I could have a look for sure if you do @dktanwar , otherwise we can discuss these suggestions in our meeting before re-runing GREAT and HOMER

dktanwar commented 4 years ago

I would first discuss it and then run the analysis. Also because the analysis will not be completed by tomorrow. If all agree, I can run over the weekend!

Irinalazar commented 4 years ago

@dktanwar

Thank you!

dktanwar commented 4 years ago

@Irinalazar

Everything is done!

goClassify and GREAT results are available. HOMER results will be available by Monday/ Tuesday (currently running).

You should look for the results in the same directories (mentioned before). For a description of folders, see the issue https://github.com/mansuylab/SC_controls/issues/28 (I updated it)