Bioinformatics: DNA foot printing

twang15 commented 3 years ago

Meeting note on Feb 26. 2021

DNA footprinting pipeline(shared by Shannon): http://www.regulatory-genomics.org/hint/introduction/
DNA footprinting paper (shared by Shannon): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1642-2
One of the best ENCODE dataset for DNA footprinting (shared by Annika): https://www.encodeproject.org/experiments/ENCSR868FGK/
ENCODE bed and bam file (shared by Annika): https://www.encodeproject.org/chip-seq/transcription_factor/#outputs

twang15 commented 3 years ago

PCR (Polymerase Chain Reaction)
Nuclease is a type of enzyme whose function is to hydrolyze the nucleic acid of the DNA and RNA. They are further classified into endonuclease and exonuclease.
If you have a dna binding protein, and you know it binds a particular molecule of dna, DNA footprinting can help you determine where on the dna molecule the protein binds. DNA footprinting was used to determine where the general TF or what sequences are recognized by the general TF.
DNAse I is one type of endonuclease. It can cleave the DNA sequence randomly or semi-randomly everywhere except where a particular protein is bound to the DNA molecule. So the protein is protecting the sequence that is bound by the protein.
- Add DNAse just enough to cut each DNA sequences exactly once.
Gel electrophoresis
DNA latter: the length of each band in the ladder tells us what the lengths are in terms of bp.
- we knew the DNA molecule is N bp long because we added a N bp dna molecule to the experiment.
- let's say that we also knew its sequence
- we just don't know where a protein is binding
- If we know that the footprinting goes from 150 to 170, then we know [150, 170] is the protein binding site, and we can also figure out the sequence of the binding site.
- Caveats:
- - dnase I might not digest completely randomly in these assays. Solution: need a control experiment.

twang15 commented 3 years ago

ChIP experiments cannot discriminate between different TF isoforms (Protein isoform)
- ChIP-seq introduction *Chip-seq identifies the location in the genome bound by proteins.
compare peaks for the same protein in different tissues to see their differential expression
Or if we didn't know the specific DNA sequence that the protein bound to, we could guess that is a motif found in all of the peaks (Motif analysis)
to determine the functional role of the protein by looking at where it binds relative to the genes. (promoter or enhancer, etc)

Protein isoforms – proteins that are similar to each other and perform similar roles within cells – have played an important role in the generation of biological diversity throughout evolution. In some cases a single gene can encode two or more isoforms by exploiting a process called alternative splicing. In other cases two or more closely related genes are responsible for the isoforms. 5' and 3' of DNA sequence Ligation Exon splice motif Copy-number variation

twang15 commented 3 years ago

Transcription factors are proteins that bind to regulatory regions in the genome. TFs are regulatory elements.

twang15 commented 3 years ago

What is the key difference between DNase-seq and ChIP-seq?

twang15 commented 3 years ago

ChIP-seq identifies the location in the genome bound by proteins

Advantage:

ChIP-Seq does not require prior knowledge
ChIP-Seq delivers genome-wide profiling with massively parallel sequencing, generating millions of counts across multiple samples for cost-effective, precise, unbiased investigation of epigenetic patterns
Captures DNA targets for transcription factors or histone modifications across the entire genome of any organism
Defines transcription factor binding sites
Reveals gene regulatory networks in combination with RNA sequencing and methylation analysis
Offers compatibility with various input DNA samples

Disadvantage:

Large Scale assays using ChIP is challenging using intact model organisms. This is because antibodies have to be generated for each TF, or, alternatively, transgenic model organisms expressing epitope-tagged TFs need to be produced
Researchers studying differential gene expression patterns in small organisms also face problems as genes expressed at low levels, in a small number of cells, in narrow time window
ChIP experiments cannot discriminate between different TF isoforms (Protein isoform)

DNase-Seq is one of the several approaches in molecular biology useful to identify DNA response elements, or regulatory regions in general, through genome-wide sequencing of regions sensitive to cleavage by DNase I.

A brief outline of the technique is the following:

DNA-protein complexes are treated with DNase I;
DNA extraction and sequencing are perfomed;
Sequences bound by regulatory proteins are protected from DNase I digestion;
Deep sequencing is performed to provide accurate representation of location of regulatory proteins in the genome.

Pros

Can detect open chromatin
No prior knowledge of the sequence or binding protein is required
Compared to formaldehyde-assisted isolation of regulatory elements and sequencing (FAIRE-seq), has greater sensitivity at promoters

Cons

DNase l is sequence-specific and hypersensitive sites might not account for the entire genome
DNA loss through the multiple purification steps limits sensitivity
Integration of DNase I with ChIP data is necessary to identify and differentiate similar protein-binding sites

twang15 commented 3 years ago

What is the difference between ChIP-seq and DNase-seq?

In ChIP-Seq, you first isolate chromatin but then you use an antibody to immunoprecipitate a specific factor in the chromatin, it could be a histone mark, or a transcription factor, for example. The DNA that was bound to the factor gets then sequenced and you can find out which genomic regions were bound by the factor at the moment of chromatin isolation.

DNAse-Seq is used to find areas of open chromatin, which are accessible to DNAse I digestion, without necessarily know what was bound to the open chromatin in terms of transcription factors, etc.

A new improved method to look at open chromatin is called ATACSeq (http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2688.html) which takes advantage of an engineered transposase that carries the sequencing adapters necessary to firm a library for next generation sequencing, and that can insert them only in exposes DNA, in other words, areas of open chromatin. This method is faster, you can start with fewer cells, and it seems to be less noisy than DNAse-Seq. We are using this method in my lab right now and having fun with it.

Of course, one can correlate ChIPSeq with DNAse or ATACSeq, as there are certain histone marks that correlate with actively transcribed or open chromatin.

twang15 commented 3 years ago

Chromatin: ChIP-Seq, DNase-Seq, FAIRE, ATAC-Seq, Nucleosome positioning

twang15 commented 3 years ago

Assembled chromosomes for hg19 and hg38 are chromosomes 1–22 (chr1–chr22), X (chrX), Y (chrY) and Mitochondrial (chrM)
position frequency matrices, describing transcription factor motifs,
Motif Logos”, a graphical representation of TF binding affinity (ie, of the PWMs)
Hidden Markov Model
- Within transcription factor binding sites, there is a specific grammar of DNase I cleavage and histone mark patterns
- a multivariate Hidden Markov Model (HMM) to model this regulatory grammar by simultaneous analysis of DNase-seq and the ChIP-seq profiles of histone modifications on a genome-wide level
Configuration of Genomic Data
single nucleotide polymorphisms (SNPs)
read alignments (BAM files), genomic profiles (wig/bigWig files) and genomic regions (bed, vcf files); RGT include classes for handling genome annotations, such as transcript and gene from standard formats (gtf files) and motif databases (transfac format)
Transcription factors
- proteins that help turn specific genes "on" or "off" by binding to nearby DNA.
- Transcription factors that are activators boost a gene's transcription. Repressors decrease transcription.
- Groups of transcription factor binding sites called enhancers and silencers can turn a gene on/off in specific parts of the body.
- Transcription factors allow cells to perform logic operations and combine different sources of information to "decide" whether to express a gene.
DNA footprinting

twang15 commented 3 years ago

HINT method description, 2014 paper low-level analysis: reads alignment with Bowtie2 and peaks calling with MACS2 transcriptome/exome differential analysis tag, tag count footprint

twang15 commented 3 years ago

Preprocessing: (output: aligned, sorted, and indexed bam file)

reads alignment with Bowtie2 and peaks calling with MACS2

install mm9 dataset

// install other genome reference

cd ~/rgtdata python setupGenomicData.py --mm9

call footprints for cDC1 and pDC cells (**output* .bed is the footprint result**)

rgt-hint footprinting --atac-seq --paired-end --organism=mm9 --output-location=./ --output-prefix=cDC1 cDC1.bam cDC1_peaks.narrowPeak

rgt-hint footprinting --atac-seq --paired-end --organism=mm9 --output-location=./ --output-prefix=pDC pDC.bam pDC_peaks.narrowPeak

HINT also outputs signals for visualization in a genome browser (output *.bigWig, This bigwig file contains the number of ATAC-seq reads at each genomic position as estimated by HINT-ATAC after signal normalization and cleavage bias correction)

rgt-hint tracks --bc --bigWig --organism=mm9 cDC1.bam cDC1_peaks.narrowPeak --output-prefix=cDC1_BC rgt-hint tracks --bc --bigWig --organism=mm9 pDC.bam pDC_peaks.narrowPeak --output-prefix=pDC_BC

Visualization in IGV

twang15 commented 3 years ago

Heterochromatin and euchromatin are two major categories of chromatin higher order structure. Heterochromatin has condensed chromatin structure and is inactive for transcription, while euchromatin has loose chromatin structure and active for transcription.
Cytobands/karyotype bands/chromosome bands/G banding: are old fashioned data. They date back to before we had genome sequences, when the way we used to identify chromosomes and chromosome regions by staining metaphase chromosomes with Giemsa, then viewing them with a microscope. Heterochromatic (closed chromatin, low gene density) regions stain darker than euchromatic (open chromatin, high gene density) regions. Bioinformatically, they're obsolete as genome coordinates are far more accurate and meaningful, however people still use them as a shorthand for identifying genomic regions; it's much quicker, easier and more memorable to say 14q21.3 than it is to say 14:46695396-50395063. I wouldn't use them for looking at heterochromatin vs euchromatin these days either, I would instead look at actual gene density and publicly available DNase sensitivity for open/closed chromatin.
5-primer UTR, 3-primer UTR, intron, exon,
samtools tutorial
- samtools index sample.sorted.bam -o sample.sorted.bam.bai
Alignment score and Mapping quality: Alignment score is the quality of of matching between the read-sequence and reference-sequence. Mapping quality is the confidence that the read is correctly mapped to the genomic coordinates. For example, a read may be mapped to several genomic locations with almost a perfect match in all locations. In that case, alignment score will be high but mapping quality will be low.

twang15 commented 3 years ago

Output: bigWig and bed file

rgt-hint tracks --bc --bigWig --organism=mm9 cDC1.bam cDC1_peaks.narrowPeak --output-prefix=cDC1_BC rgt-hint tracks --bc --bigWig --organism=mm9 pDC.bam pDC_peaks.narrowPeak --output-prefix=pDC_BC

Observation: Differential footprinting between cells

We observe that this gene has several open chromatin regions for these two cell types, but one particular region has cDC1 specific footprints.

Summary of Analysis Flow

Atac-seq: differential analysis of open chromatin regions
Footprinting: identify protein binding region for open chromatin regions
Modif matching: identify the motif for footprinting
TF binding site prediction
Use HINT to generate average ATAC-seq profiles around binding sites of particular TF. This analysis allows us to inspect the chromatin accessibility for each particular TF.
Further experiment: ChIP-seq to validate TF binding?

Tao: footprinting sounds like unique signature of a single cell on open chromatin regions. Would it be interesting to identify

One of the main applications of footprinting is to find TFs associated with a particular cellular condition

Approach: We can do this by first finding motifs overlapping with predicted footprints

Motif and Motif Score

Sequence motifs: are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF). Others are involved in important processes at the RNA level, including ribosome binding, mRNA processing (splicing, editing, polyadenylation) and transcription termination.
motif predicted binding sites (MPBS)
TFBM: Transcription factor binding motifs (TFBMs) are genomic sequences that specifically bind to transcription factors. The consensus sequence of a TFBM is variable, and there are a number of possible bases at certain positions in the motif, whereas other positions have a fixed base.
Motif score: Due to the variable nature of TFBMs, all motifs found in genomes are given a score out of one, indicating how strong the TFBM is. The score represents the probability of each base occurring at each location in the motif.
TF binding site variants
Jittering to prevent overplotting in statistical graphics

twang15 / K562-Analysis