twang15 / K562-Analysis

1 stars 1 forks source link

Bioinformatics: DNA foot printing #2

Closed twang15 closed 3 years ago

twang15 commented 3 years ago

Meeting note on Feb 26. 2021

  1. DNA footprinting pipeline(shared by Shannon): http://www.regulatory-genomics.org/hint/introduction/
  2. DNA footprinting paper (shared by Shannon): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1642-2
  3. One of the best ENCODE dataset for DNA footprinting (shared by Annika): https://www.encodeproject.org/experiments/ENCSR868FGK/
  4. ENCODE bed and bam file (shared by Annika): https://www.encodeproject.org/chip-seq/transcription_factor/#outputs
twang15 commented 3 years ago
  1. PCR (Polymerase Chain Reaction)
  2. Nuclease is a type of enzyme whose function is to hydrolyze the nucleic acid of the DNA and RNA. They are further classified into endonuclease and exonuclease. image
  3. If you have a dna binding protein, and you know it binds a particular molecule of dna, DNA footprinting can help you determine where on the dna molecule the protein binds. DNA footprinting was used to determine where the general TF or what sequences are recognized by the general TF.
  4. DNAse I is one type of endonuclease. It can cleave the DNA sequence randomly or semi-randomly everywhere except where a particular protein is bound to the DNA molecule. So the protein is protecting the sequence that is bound by the protein.
    • Add DNAse just enough to cut each DNA sequences exactly once.
  5. Gel electrophoresis
  6. DNA latter: the length of each band in the ladder tells us what the lengths are in terms of bp.
    • we knew the DNA molecule is N bp long because we added a N bp dna molecule to the experiment.
    • let's say that we also knew its sequence
    • we just don't know where a protein is binding
    • If we know that the footprinting goes from 150 to 170, then we know [150, 170] is the protein binding site, and we can also figure out the sequence of the binding site.
    • Caveats:
      • dnase I might not digest completely randomly in these assays. Solution: need a control experiment.
twang15 commented 3 years ago
  1. ChIP experiments cannot discriminate between different TF isoforms (Protein isoform)
  2. compare peaks for the same protein in different tissues to see their differential expression
  3. Or if we didn't know the specific DNA sequence that the protein bound to, we could guess that is a motif found in all of the peaks (Motif analysis)
  4. to determine the functional role of the protein by looking at where it binds relative to the genes. (promoter or enhancer, etc)
  1. Protein isoforms – proteins that are similar to each other and perform similar roles within cells – have played an important role in the generation of biological diversity throughout evolution. In some cases a single gene can encode two or more isoforms by exploiting a process called alternative splicing. In other cases two or more closely related genes are responsible for the isoforms. 5' and 3' of DNA sequence Ligation Exon splice motif Copy-number variation
twang15 commented 3 years ago

Transcription factors are proteins that bind to regulatory regions in the genome. TFs are regulatory elements.

twang15 commented 3 years ago

What is the key difference between DNase-seq and ChIP-seq?

twang15 commented 3 years ago

ChIP-seq identifies the location in the genome bound by proteins

Advantage:

Disadvantage:

DNase-Seq is one of the several approaches in molecular biology useful to identify DNA response elements, or regulatory regions in general, through genome-wide sequencing of regions sensitive to cleavage by DNase I.

A brief outline of the technique is the following:

Pros

Cons

twang15 commented 3 years ago

What is the difference between ChIP-seq and DNase-seq?

In ChIP-Seq, you first isolate chromatin but then you use an antibody to immunoprecipitate a specific factor in the chromatin, it could be a histone mark, or a transcription factor, for example. The DNA that was bound to the factor gets then sequenced and you can find out which genomic regions were bound by the factor at the moment of chromatin isolation.

DNAse-Seq is used to find areas of open chromatin, which are accessible to DNAse I digestion, without necessarily know what was bound to the open chromatin in terms of transcription factors, etc.

A new improved method to look at open chromatin is called ATACSeq (http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2688.html) which takes advantage of an engineered transposase that carries the sequencing adapters necessary to firm a library for next generation sequencing, and that can insert them only in exposes DNA, in other words, areas of open chromatin. This method is faster, you can start with fewer cells, and it seems to be less noisy than DNAse-Seq. We are using this method in my lab right now and having fun with it.

Of course, one can correlate ChIPSeq with DNAse or ATACSeq, as there are certain histone marks that correlate with actively transcribed or open chromatin.

twang15 commented 3 years ago

Chromatin: ChIP-Seq, DNase-Seq, FAIRE, ATAC-Seq, Nucleosome positioning

image

twang15 commented 3 years ago
  1. Assembled chromosomes for hg19 and hg38 are chromosomes 1–22 (chr1–chr22), X (chrX), Y (chrY) and Mitochondrial (chrM)

  2. position frequency matrices, describing transcription factor motifs,

  3. Motif Logos”, a graphical representation of TF binding affinity (ie, of the PWMs) image

  4. Hidden Markov Model

    • Within transcription factor binding sites, there is a specific grammar of DNase I cleavage and histone mark patterns
    • a multivariate Hidden Markov Model (HMM) to model this regulatory grammar by simultaneous analysis of DNase-seq and the ChIP-seq profiles of histone modifications on a genome-wide level
  5. Configuration of Genomic Data

  6. single nucleotide polymorphisms (SNPs)

  7. read alignments (BAM files), genomic profiles (wig/bigWig files) and genomic regions (bed, vcf files); RGT include classes for handling genome annotations, such as transcript and gene from standard formats (gtf files) and motif databases (transfac format)

  8. Transcription factors

    • proteins that help turn specific genes "on" or "off" by binding to nearby DNA.
    • Transcription factors that are activators boost a gene's transcription. Repressors decrease transcription.
    • Groups of transcription factor binding sites called enhancers and silencers can turn a gene on/off in specific parts of the body.
    • Transcription factors allow cells to perform logic operations and combine different sources of information to "decide" whether to express a gene.
  9. DNA footprinting

twang15 commented 3 years ago

HINT method description, 2014 paper low-level analysis: reads alignment with Bowtie2 and peaks calling with MACS2 transcriptome/exome differential analysis tag, tag count footprint

twang15 commented 3 years ago

Preprocessing: (output: aligned, sorted, and indexed bam file)

  1. reads alignment with Bowtie2 and peaks calling with MACS2

install mm9 dataset

// install other genome reference

cd ~/rgtdata python setupGenomicData.py --mm9

call footprints for cDC1 and pDC cells (*output .bed is the footprint result**)

rgt-hint footprinting --atac-seq --paired-end --organism=mm9 --output-location=./ --output-prefix=cDC1 cDC1.bam cDC1_peaks.narrowPeak

rgt-hint footprinting --atac-seq --paired-end --organism=mm9 --output-location=./ --output-prefix=pDC pDC.bam pDC_peaks.narrowPeak

HINT also outputs signals for visualization in a genome browser (output *.bigWig, This bigwig file contains the number of ATAC-seq reads at each genomic position as estimated by HINT-ATAC after signal normalization and cleavage bias correction)

rgt-hint tracks --bc --bigWig --organism=mm9 cDC1.bam cDC1_peaks.narrowPeak --output-prefix=cDC1_BC rgt-hint tracks --bc --bigWig --organism=mm9 pDC.bam pDC_peaks.narrowPeak --output-prefix=pDC_BC

Visualization in IGV

twang15 commented 3 years ago
  1. Heterochromatin and euchromatin are two major categories of chromatin higher order structure. Heterochromatin has condensed chromatin structure and is inactive for transcription, while euchromatin has loose chromatin structure and active for transcription.
  2. Cytobands/karyotype bands/chromosome bands/G banding: are old fashioned data. They date back to before we had genome sequences, when the way we used to identify chromosomes and chromosome regions by staining metaphase chromosomes with Giemsa, then viewing them with a microscope. Heterochromatic (closed chromatin, low gene density) regions stain darker than euchromatic (open chromatin, high gene density) regions. Bioinformatically, they're obsolete as genome coordinates are far more accurate and meaningful, however people still use them as a shorthand for identifying genomic regions; it's much quicker, easier and more memorable to say 14q21.3 than it is to say 14:46695396-50395063. I wouldn't use them for looking at heterochromatin vs euchromatin these days either, I would instead look at actual gene density and publicly available DNase sensitivity for open/closed chromatin.
  3. 5-primer UTR, 3-primer UTR, intron, exon,
  4. samtools tutorial
    • samtools index sample.sorted.bam -o sample.sorted.bam.bai
  5. Alignment score and Mapping quality: Alignment score is the quality of of matching between the read-sequence and reference-sequence. Mapping quality is the confidence that the read is correctly mapped to the genomic coordinates. For example, a read may be mapped to several genomic locations with almost a perfect match in all locations. In that case, alignment score will be high but mapping quality will be low.
twang15 commented 3 years ago

Output: bigWig and bed file

rgt-hint tracks --bc --bigWig --organism=mm9 cDC1.bam cDC1_peaks.narrowPeak --output-prefix=cDC1_BC rgt-hint tracks --bc --bigWig --organism=mm9 pDC.bam pDC_peaks.narrowPeak --output-prefix=pDC_BC

Observation: Differential footprinting between cells

We observe that this gene has several open chromatin regions for these two cell types, but one particular region has cDC1 specific footprints.

Summary of Analysis Flow

  1. Atac-seq: differential analysis of open chromatin regions
  2. Footprinting: identify protein binding region for open chromatin regions
  3. Modif matching: identify the motif for footprinting
  4. TF binding site prediction
  5. Use HINT to generate average ATAC-seq profiles around binding sites of particular TF. This analysis allows us to inspect the chromatin accessibility for each particular TF.
  6. Further experiment: ChIP-seq to validate TF binding?

Tao: footprinting sounds like unique signature of a single cell on open chromatin regions. Would it be interesting to identify

One of the main applications of footprinting is to find TFs associated with a particular cellular condition

Approach: We can do this by first finding motifs overlapping with predicted footprints

Motif and Motif Score

  1. Sequence motifs: are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF). Others are involved in important processes at the RNA level, including ribosome binding, mRNA processing (splicing, editing, polyadenylation) and transcription termination.
  2. motif predicted binding sites (MPBS)
  3. TFBM: Transcription factor binding motifs (TFBMs) are genomic sequences that specifically bind to transcription factors. The consensus sequence of a TFBM is variable, and there are a number of possible bases at certain positions in the motif, whereas other positions have a fixed base.
  4. Motif score: Due to the variable nature of TFBMs, all motifs found in genomes are given a score out of one, indicating how strong the TFBM is. The score represents the probability of each base occurring at each location in the motif.
  5. TF binding site variants
  6. Jittering to prevent overplotting in statistical graphics