Biology 1 - Githubissues

twang15 commented 3 years ago

Transcription factors:

Negative transcription factor
Positive transcription factor

Operon

RNA polymerase: an enzyme, needed to start the transcription
RNA polymerase needs a promoter: a sequence of DNA where RNA polymerase can bind to.
RNA polymerase needs bind to promoter DNA to make mRNA

Operator: (also sequences of DNA)

a sequence of DNA where a repressor can bind to
Repressor (one kind of protein), if bound to operator, block RNA polymerase to bind to DNA to make mRNA

Repressor <-> operator

There are also genes for producing the repressor. This gene also has its own promoter
Lactose (sugar) can bind to repressor and change its conformation to prevent the repressor from work

RNA polymerase <-> promoter

twang15 commented 3 years ago

TSS is defined as the transcriptional start site. This is where RNA polymerase begins transcribing the DNA. This is also the beginning of the UTR (untranslated region), assuming that the gene has a 5'UTR, which is typically the case for human genes.

chip_workflow_june2017_step5

twang15 commented 3 years ago

ChIPseeker can be seen as an alternative and newer workflow to ChIPpeakAnno. It also offers additional functionality, e.g. especially when it comes to visualising ChIP profiles and comparing functional annotations.
- It supports annotating ChIP peaks
- provides functions to visualize ChIP peaks coverage over chromosomes
- visualize profiles of peaks binding to TSS regions
- Comparison of ChIP peak profiles and annotation
- supports evaluating significant overlap among ChIP-seq datasets

Coverage:

In high-throughput sequencing, coverage is the number of reads overlapping each base. In other words, it associates a number (the number of reads) to every base in the genome.
A file format which is often used to represent coverage data is Wig or the modern version BigWig.

File Formats:

Wig: The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized.
bedGraph: The bedGraph format is also an older format used to display sparse data or data that contains elements of varying size.
bigWig: The bigWig format is the recommended format for almost all graphing track needs
Caution: For speed and efficiency, wiggle data is compressed and stored internally in 128 unique bins. This compression means that there is a minor loss of precision when data is exported from a wiggle track (i.e., with output format "data points" or "bed format" within the Table Browser). The bedGraph format should be used if it is important to retain exact data when exporting.

twang15 commented 3 years ago

DNA strand

sense strand also called coding strand, plus strand, non-template strand
- sense strand is the same as mRNA except that T in DNA is replaced by U in RNA
- the sense strand contains codons (T in DNA is replaced by U in RNA)
- tRNA has the anti-codon (transcribed from mRNA)
- tRNA is used to transfer element to make an amino acid
anti-sense strand also called non-coding strand, minus strand, template strand
- anti-sense strand acts as template for the synthesis of mRNA.
- anti-sense strand is complementary to the sense strand and mRNA (U replace T)
- anti-sense strand contains anti-codon
- Codon (3-bp sequence for an amino acid), anti-codon (mRNA complements this template)
More details
- DNA is double-stranded. By convention, for a reference chromosome, one whole strand is designated the "forward strand" and the other the "reverse strand". This designation is arbitrary. Sometimes the terms "plus strand" and "minus strand" are used instead.
- Visually (I'm not talking about the transcription machinery yet), you would typically read the sequence of a strand in the 5-3 direction. For the forward strand, this means reading left-to-right, and for the reverse strand it means right-to-left.
- A gene can live on a DNA strand in one of two orientations. The gene is said to have a coding strand (also known as its sense strand), and a template strand (also known as its antisense strand). For 50% of genes, its coding strand will correspond to the chromosome's forward strand, and for the other 50% it will correspond to the reverse strand.
- The mRNA (and protein) sequence of a gene corresponds to the DNA sequence as read (again, visually) from the gene's coding strand. So the mRNA sequence always corresponds to the 5-3 coding sequence of a gene.
- Now, the RNA polymerase machinery moves along the DNA in the 5-3 orientation of the coding strand (e.g. left-to-right for a forward strand gene). It reads the bases from the template strand (so it is reading in the 3-5 direction from the point-of-view of the template strand), and builds the mRNA as it goes. This means that the mRNA matches the coding sequence of the gene, not the template sequence. (This diagram from Wikipedia illustrates).
- Annotations such as Ensembl and UCSC are concerned with the coding sequences of genes, so when they say a gene is on the forward strand, it means the gene's coding sequence is on the forward strand. To follow through again, that means that during transcription of this forward-strand gene, the gene's template sequence is read from the reverse strand, producing an mRNA that matches the sequence on the forward strand.

twang15 commented 3 years ago

5-prime (5') and 3-prime (3')

5-prime: A term that identifies one end of a single-stranded nucleic acid molecule. The 5' end is that end of the molecule which terminates in a 5' phosphate group. The 5' direction is the direction toward the 5' end. Nucleic acid sequences are written with the 5' end to the left and the 3' end to the right, in reference to the direction of DNA synthesis during replication (from 5' to 3'), RNA synthesis during transcription (from 5' to 3'), and the reading of mRNA sequence (from 5' to 3') during translation.
3-prime: A term that identifies one end of a single-stranded nucleic acid molecule. The 3' end is that end of the molecule which terminates in a 3' phosphate group. The 3' direction is the direction toward the 3' end. Nucleic acid sequences are written with the 5' end to the left and the 3' end to the right, in reference to the direction of DNA synthesis during replication (from 5' to 3'), RNA synthesis during transcription (from 5' to 3'), and the reading of mRNA sequence (from 5' to 3') during translation.

twang15 commented 3 years ago

Visualization Genes

Built-in R package:
- Gviz: Grange
- epivizr
Standalone
- Website: UCSC Genome Browser
- IGV (website, or GUI desktop)

twang15 commented 3 years ago

ChIPseeker readPeakFile: https://rdrr.io/bioc/ChIPseeker/man/readPeakFile.html

Data structures

GRanges, GrangesList: https://biodatascience.github.io/compbio/bioc/GRL.html
IRanges, Normal IRanges
Rle, Views, RangedData

File format

bed, bigbed, narrowPeak
wig, bigwig, bedGraph

Visualization

covplot
- After peak calling, we would like to know the peak locations over the whole genome, covplot function calculates the coverage of peak regions over chromosomes and generate a figure to visualize
- Support both bed and GRangesList
- GRangesList is also supported and can be used to compare coverage of multiple bed files.

twang15 commented 3 years ago

IRanges operations

Finding overlaps: Finding (pairwise) overlaps between two IRanges
- as(ov, "matrix") to convert an overlapping (ov) into a matrix
- queryHits(), subjectHits() (often used with unique())
- countOverlaps(ir1, ir2)
- intersect
Finding nearest IRanges: nearest(), precede(), follow()

twang15 commented 3 years ago

Indexing VCF

One cannot create a fai from VCF file. VCF indexing produces idx files and fasta indexing generates fai files

hg19 to hg38 for SNP VCF

CrossMap.py vcf hg19ToHg38.over.chain ENCFF752OAX.vcf ~/rgtdata/hg38/genome_hg38.fa ENCFF752OAX_hg38.vcf @ 2021-04-15 12:51:36: Lifting over ... @ 2021-04-15 12:53:07: Total entries: 3781183 @ 2021-04-15 12:53:07: Failed to map: 26485
CrossMap.py vcf hg19ToHg38.over.chain ENCFF752OAX.vcf ~/rgtdata/hg38/genome_hg38.fa ENCFF752OAX_hg38_no_comp_allele.vcf --no-comp-allele @ 2021-04-15 13:03:24: Lifting over ... @ 2021-04-15 13:04:51: Total entries: 3781183 @ 2021-04-15 13:04:51: Failed to map: 6312

VCF to bed

installation of bedops: https://github.com/bedops/bedops
vcf2bed --keep-header ENCFF752OAX_hg38.bed

twang15 commented 3 years ago

Bed tools tutorial

Objective and tasks: https://github.com/twang15/K562-Analysis/issues/5

twang15 commented 3 years ago

Why intersect?

By far, the most common question asked of two sets of genomic features is whether or not any of the features in the two sets “overlap” with one another.
bedtools intersect works with both BED/GFF/VCF and BAM files as input.

twang15 / K562-Analysis

Biology 1 #6

File Formats:

DNA strand

5-prime (5') and 3-prime (3')

Visualization Genes

Data structures

File format

Visualization

IRanges operations

Indexing VCF

hg19 to hg38 for SNP VCF

VCF to bed

Bed tools tutorial

Why intersect?