twang15 / K562-Analysis

1 stars 1 forks source link

Meeting memos #5

Closed twang15 closed 3 years ago

twang15 commented 3 years ago

Bi-weekly meeting memo

  1. Files that Shannon will share to Tao
    • ATAC-seq datasets: fastq
    • variant dataset: vcf
    • packages: R Bioconductor, ChIPPeakAnno, ChipSeerker
  2. make the footprint for ATacseq datasets
  3. overlap the footprint with variant datasets, see if there are variants in footprint
    • SNP and Indel first
  4. Differential analysis: original footprint vs. variant-filtered footprint
twang15 commented 3 years ago

These packages can work together so you should go through both to find the best ways to use each package (there is some overlap between the two)

Variant dataset:

ATAC-seq dataset (@Annika Weimer can you tell us if there’s a preference between these two experiments):

twang15 commented 3 years ago

04/15/2021, Annika, Shannon, Tao

  1. separate different variants (SNP, Insertion, deletion, everything) for footprint overlapping
  2. Try both bedtools and ChIPAnnon, ChIPseeker
  3. iteratively refine the pipeline
smwhite7 commented 3 years ago

04/29/21 - Comment

Tao - I forgot to mention I do think we should look into the differences in vcf2bed and bedops conversion of the vcf files to bed files. Can you look into the number of lines in each file (not including headers) and see if they add up to the original? For example ENCFF752OAX should have X entries, so do the number of lines in the insertions, deletions, and SNV subsetted files add up to X? And how does that compare to the file created by bedops? I want to make sure we aren't losing a significant number of SNP information during the conversion.

smwhite7 commented 3 years ago

5/7/21, Shannon and Tao

1. bedtools intersect -a Factor.bed -b footprint_variants.bed > factor_regions_with_footprints.bed bedtools intersect -a Factor.bed -b footprint_variants.bed -v > factor_regions_without_footprints

https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html

computeMatrix reference-point -S Factor.bigwig -R factor_regions_with_footprints.bed factor_regions_without_footprints.bed -a 3000 -b 3000 -o matrix.mat.gz

https://deeptools.readthedocs.io/en/develop/content/tools/plotProfile.html

plotProfile -m matrix.mat.gz

  1. computeMatrix reference-point -S Factor.bigwig -R footprint_variants.bed Factor.bed -a 3000 -b 3000 -o matrix.mat.gz

More thoughts about the best regions to compare to variant-containing-footprints. Look at literature.

smwhite7 commented 3 years ago

5/20/21, Tao, Shannon, Annika, and Jay

We discussed looking at how TF motifs overlap SNPs and then plotting the difference in TF binding signal at motifs with and without SNPs. We also discussed looking at TF binding relative to SNP position, for SNPs that overlapped TF peaks.

However, I think our analysis should be repeated using consensus footprints (footprints derived from 243 cell lines, https://www.vierstra.org/resources/dgf#citation) so we can capture footprints that may have been missed using K562-specific footprints. If a SNP is blocking TF binding, then a footprint may not be detected in the K562 dnase-seq.

To investigate this we should do the following:

1) Overlap consensus footprints (https://resources.altius.org/~jvierstra/projects/footprinting.2020/consensus.index/collapsed_motifs_overlapping_consensus_footprints_hg38.bed.gz) with K562 SNPs and divide consensus footprints into 2 groups: 1-consensus footprints with SNPs and 2-consensus footprints without SNPs 2) Determine motif enrichment in consensus footprints with SNPs. a) For each group (consensus footprints with SNPs and consensus footprints without SNPs), calculate the ratio of consensus footprints with motif (annotated in column 4 of consensus footprint file) out of all of the footprints in that group. Then calculate the fold change for each motif. For example, if consensus footprints with SNPs has 10 footprints with KLF motif out of a total of 100 footprints the ratio would be 10/100. Then we'd calculate the ratio of KLF motifs in consensus footprints without SNPs (let's say 20/500). Then the fold change would be (10/100)/(20/500) = 2.5. Then you can plot the FC for all of the motifs and we can determine what motifs are enriched at consensus footprints with SNPs.

I would also like you to try this method, which also uses the consensus footprints but here they are merged if they overlap each other. I'm a little worried in the strategy above that there will be too many footprints without SNPs that it will be difficult to see a clear pattern. Here you would:

1) Overlap consensus footprints (https://resources.altius.org/~jvierstra/projects/footprinting.2020/consensus.index/consensus_footprints_and_collapsed_motifs_hg38.bed.gz) with K562 SNPs and divide consensus footprints into 2 groups: 1-consensus footprints with SNPs and 2-consensus footprints without SNPs 2) Determine motif enrichment in consensus footprints with SNPs. a) Using the first 3 columns of the consensus footprint file, you can use the hint motif matching for both groups of consensus footprints and hint enrichment as you did before.

We can then compare these results with the SNP2TFBS results and see what factors to focus on after this.

smwhite7 commented 3 years ago

6/3/21

Tao showed his results from item 1 of 5/20/21 post (we decided item 2 was unnecessary). He calculated the log2FC and qvalues for enriched motif families in consensus footprints that contain SNPs VS consensus footprints without SNPs and generated a volcano plot.

Next steps are to examine the signal values of Factors within these enriched motif families. We will compare factor peaks that contain SNPs and factor peaks without SNPs and plot the signal value using deeptools computeMatrix + plotProfile.

Factors & Files to use for this analysis:

ZNF354 motif family: ZNF354B- bed: https://www.encodeproject.org/files/ENCFF712NHB/@@download/ENCFF712NHB.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF551YRA/@@download/ENCFF551YRA.bigWig TBX/1 motif family: TBX18- bed: https://www.encodeproject.org/files/ENCFF405EGZ/@@download/ENCFF405EGZ.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF274UBU/@@download/ENCFF274UBU.bigWig MGA- bed: https://www.encodeproject.org/files/ENCFF524ZER/@@download/ENCFF524ZER.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF806AVD/@@download/ENCFF806AVD.bigWig EWSR1/FLI1 motif family: EWSR1- bed: https://www.encodeproject.org/files/ENCFF924FYI/@@download/ENCFF924FYI.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF043WON/@@download/ENCFF043WON.bigWig

Control files: DLX4- bed:https://www.encodeproject.org/files/ENCFF832DBU/@@download/ENCFF832DBU.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF448BAC/@@download/ENCFF448BAC.bigWig CEBPB- bed: https://www.encodeproject.org/files/ENCFF712ZNR/@@download/ENCFF712ZNR.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF032ORM/@@download/ENCFF032ORM.bigWig RHOX2FB- bed: https://www.encodeproject.org/files/ENCFF727JXN/@@download/ENCFF727JXN.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF721HET/@@download/ENCFF721HET.bigWig ATF3- bed: https://www.encodeproject.org/files/ENCFF556SMM/@@download/ENCFF556SMM.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF232TBZ/@@download/ENCFF232TBZ.bigWig ZEB2- bed: https://www.encodeproject.org/files/ENCFF475CIS/@@download/ENCFF475CIS.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF538PMW/@@download/ENCFF538PMW.bigWig FOSL1- bed: https://www.encodeproject.org/files/ENCFF256NXT/@@download/ENCFF256NXT.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF914GYG/@@download/ENCFF914GYG.bigWig

To do this we need to do the following (similar to what is explained in 5/7/21 notes):

  1. Overlap (1) Factor peaks with (2) consensus footprints containing SNPs, outputting bed files that are factor peaks with footprints and factor peaks without footprints. bedtools intersect -a Factor.bed -b footprint_variants.bed > factor_regions_with_footprints.bed bedtools intersect -a Factor.bed -b footprint_variants.bed -v > factor_regions_without_footprints

  2. Use computeMatrix to calculate signal value at these two output bed files (i.e. computeMatrix reference-point -S Factor.bigwig -R factor_regions_with_footprints.bed factor_regions_without_footprints-a 3000 -b 3000 -o matrix.mat.gz). Make sure to center on peak center instead of TSS

  3. Plot signal profiles using plotProfile or plotHeatmap.

We expect to see a reduction in signal at peaks that overlap SNPs because we hypothesize these peaks may interrupt factor binding.

twang15 commented 3 years ago

06/11/2021 (Week8) Shannon, Tao

  1. ZNF354B peaks containing snps: ZNF354B_regions_with_footprint.bed
  2. Figure out the precise genomic coordinates for the small peaks on ZNF354B, from ZNF354B_both_footprint.mat
  3. based on the output from step 2, we can check whether there are other motifs binding to that region. motif finder:
    • traditional tools only look at the peak centers (local scanning)
    • CentriMo: (global scanning), alternative way Shannon wants to try
  4. check whether ZNF354B has affinity to TFs found in step 3.
  5. Alternative: scan all the ChIP-seq data on the portal to see whether there are TFs co-bound.
smwhite7 commented 3 years ago

06/11/2021 (Week8) Shannon, Tao

  1. ZNF354B peaks containing snps: ZNF354B_regions_with_footprint.bed
  2. Figure out the precise genomic coordinates for the small peaks on ZNF354B, from ZNF354B_both_footprint.mat
  3. based on the output from step 2, we can check whether there are other motifs binding to that region. motif finder:
  • traditional tools only look at the peak centers (local scanning)
  • CentriMo: (global scanning), alternative way Shannon wants to try
  1. check whether ZNF354B has affinity to TFs found in step 3.
  2. Alternative: scan all the ChIP-seq data on the portal to see whether there are TFs co-bound.

Summary: TF signal plots show TF binding is reduced (ZNF354B and TBX18) or increased (MGA) in peaks containing SNPs. ZNF354B and TBX18 show new signals upstream or downstream, respectively, of peak centers, suggesting binding may be rerouted when SNPs are present. Next goal is to explore how binding of other factors differs at regions containing SNPs vs regions not containing SNPs for the 3 factors (ZNF354B, TBX18, and MGA). We will want to investigate 1) actual binding of other factors using ChIP-seq data, 2) motif enrichment to determine direct vs indirect binding, 3) gene regulation using RNA-seq expression data, and 4) 3D data to explore if chromatin interactions are impacted.

To start we discussed doing the following for ZNF354B data:

For exploring "new" upstream signal peak:

  1. Use CentriMo (from MEME suite, https://meme-suite.org/meme/tools/centrimo) to search for motif enrichment along ZNF354B peaks with SNPs. To do this will need to expand ZNF354B peaks by adding 1500 bp upstream and 1500bp downstream.
  2. Identify coordinates (chr:start-end) of new peak in ZNF354B peaks containing SNPs.
  3. Once have coordinates, send to me and I will overlap with all other factor data to see what binds those regions. Note for me- also should look at binding of "paired" original peaks that match new peaks.

For exploring original peaks with and without SNPs:

  1. Use STREME (from MEME suite, https://meme-suite.org/meme/doc/streme.html?man_type=web) to look at motif enrichment for ZNF354B peaks with SNPs and ZNF354B peaks without SNPs. STREME identifies sequences that are enriched but does not match with known TF motifs. To match with known motifs, take STREME results and run with Tomtom (https://meme-suite.org/meme/tools/tomtom) to match sequences with known motifs.

Note for Tao: I'm having you use the MEME suite because it is much more vetted for motif analysis than the HINT motif matching. We should plan to use MEME suite from now on.

smwhite7 commented 3 years ago

6/29/21 - Notes for Shannon and Tao

Next steps: 1) Assessing what footprint motifs are present in ZNF354B, TBX18, and MGA peaks. Correlate these results with overlap data (Shannon has generated). 2) Determine what nucleotide within motif SNPs are occurring. Is there a pattern among nucleotide (A,T,C,G), position of SNP in motif?, position of SNP within peak?

twang15 commented 3 years ago

does the binding of each TF get increased or reduced (for peak regions with snp and peak regions w/o snp)? we can do a statistical test for all ~450 TFs.

Inputs: /home/smwhite4/Downloaded_files/K562/Tao_Data/TFs_intersect/

smwhite7 commented 3 years ago

7/1/21 Meeting (Shannon + Tao):

  1. Decided to first compare direct (vis footprints) and all (via ChIP-seq overlap data) binding of all factors at "target" Factor peaks with or without SNPs. Tao will use similar enrichment method as he did for identifying motifs.

To assess differences in direct binding, we need to overlap the consensus footprints with 1. the target factor peaks with SNPs and 2. the target factor peaks without SNPs. Can then look at enrichment of factor binding at peaks with SNPs over peaks without SNPs (baseline).

To assess differences in all binding (direct and indirect), Shannon already overlapped all ChIP-seq bed files with target factor bed files with or without SNPs (generated by Tao). These files are ready for enrichment analysis.

Notes for future directions:

twang15 commented 3 years ago

07/15/2021, Shannon + Tao

What is ChIP-seq overlap data? Why does it capture both direct and indirect binding? Why does footprint only capture direct binding?

Goal: to assess differences in all binding (direct & indirect)

Contigency Table for all binding(direct & indirect):

  w/ snp w/o snp
TF    
all    
twang15 commented 3 years ago

Jay, Shannon and Tao, July 22, 2021

  1. Position of SNP, Position of footprint, Position of Transcription factor peak regions
    • is a SNP within a footprint?
  2. number of SNPs in TF peaks
  3. what are the motifs?
  4. BPnet:
smwhite7 commented 3 years ago

7/22/21 Meeting Notes

  1. Discussed using BPnet (CNN to predict TF motifs and binding patterns) to explore changes in combinatorially binding at sequences containing SNPs and other variants. Initial plans are to train the model on the 3 "reference" TFs identified by Tao with the proteins shown to bind significantly more to peaks containing SNPs. Tao is going to read the paper and look into their github to help us understand the program more and it's limitations (such as how many factors can we train on). We also discussed setting up a meeting with Anshul in the future.

  2. Next step (before BPnet) would be to explore where these SNPs are occurring within the peaks. These are the following interesting questions:

    • Do SNPs within "reference" TF peaks occur within footprints that match the "reference" TF?
    • If SNP-containing footprints within "reference" peaks do not match the "reference" TF, then what TF do they match?
    • Where are the footprints positioned within the peak? Is there any patterns such as always X bp from the summit (center), etc.?
    • For the SNPs within footprints in "reference" peaks, what is the Ref vs Alt nucleotide? In other words, what nucleotide is the bp changing to and is there a pattern?
twang15 commented 3 years ago

reference TFs: ZNF354B, TBX18, MGA

target TFs: more than 400 transcription factors

enriched target TFs: enrichment analysis with respect to reference TF filtered by SNPs

To study the cooperativity/interaction/co-binding/correlations between TFs

TF peaks > footprint > motif > SNP

twang15 commented 3 years ago

for SNP and footprint

consensus_footprint_with_snp.bed

TODOs

a. use the first four columns as the key for consensus_footprint_with_snp.bed (and consensus_footprint_without_snp.bed) and keep the one with the highest match_score to remove redundancy to produce consensus_uniq_footprint_with_snp.bed (and consensus_uniq_footprint_without_snp.bed). (although the current way to count the unique motif cluster by awk '{print $4}' consensus_footprint_with_snp.bed | sort | uniq -c > motif_with_snp.csv is also correct)

b. get the total of unique footprint (denominator): unique number of footprint in consensus_footprint_with_snp.bed (and consensus_footprint_without_snp.bed), should use the first 3 columns as the key to deduplicate

c. do enrichment analysis (fisher test) as before

  1. use the first three columns + best_model of consensus_footprint_with_snp.bed and keep the one with the highest match_score , and then do enrichment analysis
    • Shannon will update to confirm whether we need to do this.
twang15 commented 3 years ago

08/06/2021, Shannon, Tao, Jay, Mike

  1. Shannon got a new idea to try and we will continue next Tuesday after meeting Anshul.
  2. Prepare questions for meeting with Anshul
twang15 commented 3 years ago

A new one is opened here https://github.com/twang15/K562-Analysis/issues/12