Closed twang15 closed 3 years ago
These packages can work together so you should go through both to find the best ways to use each package (there is some overlap between the two)
Variant dataset:
ATAC-seq dataset (@Annika Weimer can you tell us if there’s a preference between these two experiments):
04/29/21 - Comment
Tao - I forgot to mention I do think we should look into the differences in vcf2bed and bedops conversion of the vcf files to bed files. Can you look into the number of lines in each file (not including headers) and see if they add up to the original? For example ENCFF752OAX should have X entries, so do the number of lines in the insertions, deletions, and SNV subsetted files add up to X? And how does that compare to the file created by bedops? I want to make sure we aren't losing a significant number of SNP information during the conversion.
1. bedtools intersect -a Factor.bed -b footprint_variants.bed > factor_regions_with_footprints.bed bedtools intersect -a Factor.bed -b footprint_variants.bed -v > factor_regions_without_footprints
https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html
computeMatrix reference-point -S Factor.bigwig -R factor_regions_with_footprints.bed factor_regions_without_footprints.bed -a 3000 -b 3000 -o matrix.mat.gz
https://deeptools.readthedocs.io/en/develop/content/tools/plotProfile.html
plotProfile -m matrix.mat.gz
More thoughts about the best regions to compare to variant-containing-footprints. Look at literature.
We discussed looking at how TF motifs overlap SNPs and then plotting the difference in TF binding signal at motifs with and without SNPs. We also discussed looking at TF binding relative to SNP position, for SNPs that overlapped TF peaks.
However, I think our analysis should be repeated using consensus footprints (footprints derived from 243 cell lines, https://www.vierstra.org/resources/dgf#citation) so we can capture footprints that may have been missed using K562-specific footprints. If a SNP is blocking TF binding, then a footprint may not be detected in the K562 dnase-seq.
To investigate this we should do the following:
1) Overlap consensus footprints (https://resources.altius.org/~jvierstra/projects/footprinting.2020/consensus.index/collapsed_motifs_overlapping_consensus_footprints_hg38.bed.gz) with K562 SNPs and divide consensus footprints into 2 groups: 1-consensus footprints with SNPs and 2-consensus footprints without SNPs 2) Determine motif enrichment in consensus footprints with SNPs. a) For each group (consensus footprints with SNPs and consensus footprints without SNPs), calculate the ratio of consensus footprints with motif (annotated in column 4 of consensus footprint file) out of all of the footprints in that group. Then calculate the fold change for each motif. For example, if consensus footprints with SNPs has 10 footprints with KLF motif out of a total of 100 footprints the ratio would be 10/100. Then we'd calculate the ratio of KLF motifs in consensus footprints without SNPs (let's say 20/500). Then the fold change would be (10/100)/(20/500) = 2.5. Then you can plot the FC for all of the motifs and we can determine what motifs are enriched at consensus footprints with SNPs.
I would also like you to try this method, which also uses the consensus footprints but here they are merged if they overlap each other. I'm a little worried in the strategy above that there will be too many footprints without SNPs that it will be difficult to see a clear pattern. Here you would:
1) Overlap consensus footprints (https://resources.altius.org/~jvierstra/projects/footprinting.2020/consensus.index/consensus_footprints_and_collapsed_motifs_hg38.bed.gz) with K562 SNPs and divide consensus footprints into 2 groups: 1-consensus footprints with SNPs and 2-consensus footprints without SNPs 2) Determine motif enrichment in consensus footprints with SNPs. a) Using the first 3 columns of the consensus footprint file, you can use the hint motif matching for both groups of consensus footprints and hint enrichment as you did before.
We can then compare these results with the SNP2TFBS results and see what factors to focus on after this.
6/3/21
Tao showed his results from item 1 of 5/20/21 post (we decided item 2 was unnecessary). He calculated the log2FC and qvalues for enriched motif families in consensus footprints that contain SNPs VS consensus footprints without SNPs and generated a volcano plot.
Next steps are to examine the signal values of Factors within these enriched motif families. We will compare factor peaks that contain SNPs and factor peaks without SNPs and plot the signal value using deeptools computeMatrix + plotProfile.
Factors & Files to use for this analysis:
ZNF354 motif family: ZNF354B- bed: https://www.encodeproject.org/files/ENCFF712NHB/@@download/ENCFF712NHB.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF551YRA/@@download/ENCFF551YRA.bigWig TBX/1 motif family: TBX18- bed: https://www.encodeproject.org/files/ENCFF405EGZ/@@download/ENCFF405EGZ.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF274UBU/@@download/ENCFF274UBU.bigWig MGA- bed: https://www.encodeproject.org/files/ENCFF524ZER/@@download/ENCFF524ZER.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF806AVD/@@download/ENCFF806AVD.bigWig EWSR1/FLI1 motif family: EWSR1- bed: https://www.encodeproject.org/files/ENCFF924FYI/@@download/ENCFF924FYI.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF043WON/@@download/ENCFF043WON.bigWig
Control files: DLX4- bed:https://www.encodeproject.org/files/ENCFF832DBU/@@download/ENCFF832DBU.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF448BAC/@@download/ENCFF448BAC.bigWig CEBPB- bed: https://www.encodeproject.org/files/ENCFF712ZNR/@@download/ENCFF712ZNR.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF032ORM/@@download/ENCFF032ORM.bigWig RHOX2FB- bed: https://www.encodeproject.org/files/ENCFF727JXN/@@download/ENCFF727JXN.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF721HET/@@download/ENCFF721HET.bigWig ATF3- bed: https://www.encodeproject.org/files/ENCFF556SMM/@@download/ENCFF556SMM.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF232TBZ/@@download/ENCFF232TBZ.bigWig ZEB2- bed: https://www.encodeproject.org/files/ENCFF475CIS/@@download/ENCFF475CIS.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF538PMW/@@download/ENCFF538PMW.bigWig FOSL1- bed: https://www.encodeproject.org/files/ENCFF256NXT/@@download/ENCFF256NXT.bed.gz bigwig: https://www.encodeproject.org/files/ENCFF914GYG/@@download/ENCFF914GYG.bigWig
To do this we need to do the following (similar to what is explained in 5/7/21 notes):
Overlap (1) Factor peaks with (2) consensus footprints containing SNPs, outputting bed files that are factor peaks with footprints and factor peaks without footprints. bedtools intersect -a Factor.bed -b footprint_variants.bed > factor_regions_with_footprints.bed bedtools intersect -a Factor.bed -b footprint_variants.bed -v > factor_regions_without_footprints
Use computeMatrix to calculate signal value at these two output bed files (i.e. computeMatrix reference-point -S Factor.bigwig -R factor_regions_with_footprints.bed factor_regions_without_footprints-a 3000 -b 3000 -o matrix.mat.gz). Make sure to center on peak center instead of TSS
Plot signal profiles using plotProfile or plotHeatmap.
We expect to see a reduction in signal at peaks that overlap SNPs because we hypothesize these peaks may interrupt factor binding.
06/11/2021 (Week8) Shannon, Tao
- ZNF354B peaks containing snps: ZNF354B_regions_with_footprint.bed
- Figure out the precise genomic coordinates for the small peaks on ZNF354B, from ZNF354B_both_footprint.mat
- based on the output from step 2, we can check whether there are other motifs binding to that region. motif finder:
- traditional tools only look at the peak centers (local scanning)
- CentriMo: (global scanning), alternative way Shannon wants to try
- check whether ZNF354B has affinity to TFs found in step 3.
- Alternative: scan all the ChIP-seq data on the portal to see whether there are TFs co-bound.
Summary: TF signal plots show TF binding is reduced (ZNF354B and TBX18) or increased (MGA) in peaks containing SNPs. ZNF354B and TBX18 show new signals upstream or downstream, respectively, of peak centers, suggesting binding may be rerouted when SNPs are present. Next goal is to explore how binding of other factors differs at regions containing SNPs vs regions not containing SNPs for the 3 factors (ZNF354B, TBX18, and MGA). We will want to investigate 1) actual binding of other factors using ChIP-seq data, 2) motif enrichment to determine direct vs indirect binding, 3) gene regulation using RNA-seq expression data, and 4) 3D data to explore if chromatin interactions are impacted.
To start we discussed doing the following for ZNF354B data:
For exploring "new" upstream signal peak:
For exploring original peaks with and without SNPs:
Note for Tao: I'm having you use the MEME suite because it is much more vetted for motif analysis than the HINT motif matching. We should plan to use MEME suite from now on.
6/29/21 - Notes for Shannon and Tao
Next steps: 1) Assessing what footprint motifs are present in ZNF354B, TBX18, and MGA peaks. Correlate these results with overlap data (Shannon has generated). 2) Determine what nucleotide within motif SNPs are occurring. Is there a pattern among nucleotide (A,T,C,G), position of SNP in motif?, position of SNP within peak?
does the binding of each TF get increased or reduced (for peak regions with snp and peak regions w/o snp)? we can do a statistical test for all ~450 TFs.
Inputs: /home/smwhite4/Downloaded_files/K562/Tao_Data/TFs_intersect/
7/1/21 Meeting (Shannon + Tao):
To assess differences in direct binding, we need to overlap the consensus footprints with 1. the target factor peaks with SNPs and 2. the target factor peaks without SNPs. Can then look at enrichment of factor binding at peaks with SNPs over peaks without SNPs (baseline).
To assess differences in all binding (direct and indirect), Shannon already overlapped all ChIP-seq bed files with target factor bed files with or without SNPs (generated by Tao). These files are ready for enrichment analysis.
Notes for future directions:
What is ChIP-seq overlap data? Why does it capture both direct and indirect binding? Why does footprint only capture direct binding?
Goal: to assess differences in all binding (direct & indirect)
Contigency Table for all binding(direct & indirect):
w/ snp | w/o snp | |
---|---|---|
TF | ||
all |
7/22/21 Meeting Notes
Discussed using BPnet (CNN to predict TF motifs and binding patterns) to explore changes in combinatorially binding at sequences containing SNPs and other variants. Initial plans are to train the model on the 3 "reference" TFs identified by Tao with the proteins shown to bind significantly more to peaks containing SNPs. Tao is going to read the paper and look into their github to help us understand the program more and it's limitations (such as how many factors can we train on). We also discussed setting up a meeting with Anshul in the future.
Next step (before BPnet) would be to explore where these SNPs are occurring within the peaks. These are the following interesting questions:
TF peaks > footprint > motif > SNP
SNPs: week3/ENCFF752OAX_hg38_no_comp_allele_all_6_fields_sorted.bed
footprint (consensus, not K562-specific): week7/collapsed_motifs_overlapping_consensus_footprints_hg38.bed
Reference TFs (TF peaks):
We first intersect SNPs with footprint to get consensus_footprint_with_snp.bed and consensus_footprint_without_snp.bed
Then, intersect reference TF peaks with SNP-filtered footprint to get filtered reference TF peaks
Then,
consensus_footprint_with_snp.bed
a. use the first four columns as the key for consensus_footprint_with_snp.bed (and consensus_footprint_without_snp.bed) and keep the one with the highest match_score to remove redundancy to produce consensus_uniq_footprint_with_snp.bed (and consensus_uniq_footprint_without_snp.bed). (although the current way to count the unique motif cluster by awk '{print $4}' consensus_footprint_with_snp.bed | sort | uniq -c > motif_with_snp.csv is also correct)
b. get the total of unique footprint (denominator): unique number of footprint in consensus_footprint_with_snp.bed (and consensus_footprint_without_snp.bed), should use the first 3 columns as the key to deduplicate
c. do enrichment analysis (fisher test) as before
A new one is opened here https://github.com/twang15/K562-Analysis/issues/12
Bi-weekly meeting memo