twang15 / Long-read-RNA

0 stars 0 forks source link

Bioinformatics-1 #4

Closed twang15 closed 2 years ago

twang15 commented 3 years ago

Sequencing Read

In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters.

Sequencing depth

Sequencing depth (also known as read depth) describes the number of times that a given nucleotide in the genome has been read in an experiment. ... These are individually read and then bioinformatically overlapped or “tiled” to generate longer contiguous sequences making up the meaningful end data.

Sequencing Coverage

Coverage refers to the number of times the sequencing machine will sequence your genome. Because you have 6 billion letters in your genome, even if the sequencing machine was incredibly accurate 99.99% of the time, the 0.01% error rate means that your genome may have 600,000 errors!

To significantly reduce the potential for errors, most sequencing services will sequence your genome many times! Each additional time further reduces the error rate, which means that the more times your genome is sequenced (ie the higher the coverage), the more accurate the data will be.

When you see genome sequencing being sold, ‘x’ marks the spot to look for! The number before the ‘x’ is the coverage (the average number of times your genome will be sequenced).

30x WGS

For example, when you get 30x WGS, the ‘30x’ means that your entire genome will be sequenced an average of 30 times. If genome sequencing is only 0.4x, this means that the entire genome is sequenced less than a single time. This means that theoretically, genome sequencing data from 0.4x WGS may contain a lot of gaps in the data.

30x versus 0.4x Sequencing

WGS

WGS is a laboratory technique in which the entire coding (exon) and non-coding (intron) regions of the genome are analyzed. It provides a comprehensive map of a person’s entire genetic makeup, which consists of nearly 6 billion letters. That’s 6 billion data points for each person’s genome!

Genotyping test

Genotyping tests, also know as DNA chips and DNA microarrays, are very affordable (usually less than $100 per test) while also being very accurate, which is why they are excellent alternatives to more expensive whole genome sequencing. Genotyping is the type of DNA testing used most often, especially when DNA tests are sold online. For example, genotyping is the technology we use for our Ultimate DNA Test and it’s also the same technology used by 23andMe and AncestryDNA. Genotyping tests provide a tremendous amount of useful data for almost everything you’ll want to know about your genes (ancestry, health, medication reactions, wellness, nutrition, fitness, sleep, etc). This type of testing may also be used to create an extensive personalized health plan.

Value of WGS

WGS is our most powerful tool for testing for genetic disorders such as mutations that cause rare diseases as well as some forms of cancer. It’s now also being used for tracking infectious disease outbreaks.

twang15 commented 3 years ago

A haplotype (单倍型) is a group of alleles (等位基因) in an organism that are inherited together from a single parent.

twang15 commented 3 years ago

Get a gene for a specific position from UCSC

wget -q -O - "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,100010" | grep -v "<"

Togow's

wget http://togows.org/api/ucsc/hg38/chr1:100000-100010 //sequence only wget http://togows.org/api/ucsc/hg38/chr1:100000-100010.fasta //fasta format

With .fa (fasta format abbr) and .bed file

$ cat test.fa

chr1 AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out $ cat test.fa.out

chr1:5-10 AAACC

two python solution: Extract DNA sequence from FASTA file using Bed file

twang15 commented 3 years ago

Our previous idea to modify bcftools and print out READ_NAME does not work because multiple READ_NAMEs may map to the same “CHROM POS” in output2.

Modifying samtools to print out REF is easy, but it is difficult to get ALT. I don't know how ALT is determined by bcftools.

twang15 commented 3 years ago

read name uniqueness: there are 129736 read names that show twice, and 62564 read names only show once. 62564 1 129736 2

There are 64 unique barcodes, occurring 322036 times in sam file. 1161 1-NPC12 BC001 975 1-NPC12 BC002 143 1-NPC12 BC003 12010 1-NPC12 BC004 1402 1-NPC12 BC005 1817 1-NPC12 BC006 1158 1-NPC12 BC007 423 1-NPC12 BC008 250 1-NPC12 BC009 3497 1-NPC12 BC010 1207 1-NPC12 BC011 957 1-NPC12 BC012 116 1-NPC12 BC013 1085 1-NPC12 BC014 1521 1-NPC12 BC015 949 1-NPC12 BC016 135 1-NPC12 BC097 5899 1-NPC12 BC098 2563 1-NPC12 BC099 3804 1-NPC12 BC100 1533 1-NPC12 BC101 125 1-NPC12 BC102 3477 1-NPC12 BC103 586 1-NPC12 BC104 22 1-NPC12 BC105 218 1-NPC12 BC106 224 1-NPC12 BC107 898 1-NPC12 BC108 1560 1-NPC12 BC109 2588 1-NPC12 BC110 1646 1-NPC12 BC111 41280 1-NPC12 BC112 22 1-NPC12 BC193 47 1-NPC12 BC194 13 1-NPC12 BC195 29 1-NPC12 BC196 50 1-NPC12 BC197 17 1-NPC12 BC198 75 1-NPC12 BC199 22 1-NPC12 BC200 34006 1-NPC12 BC201 45030 1-NPC12 BC202 857 1-NPC12 BC203 4327 1-NPC12 BC204 786 1-NPC12 BC205 592 1-NPC12 BC206 266 1-NPC12 BC207 7693 1-NPC12 BC208 346 1-NPC12 BC289 13175 1-NPC12 BC290 566 1-NPC12 BC291 694 1-NPC12 BC292 278 1-NPC12 BC293 34749 1-NPC12 BC294 410 1-NPC12 BC295 996 1-NPC12 BC296 69020 1-NPC12 BC297 5549 1-NPC12 BC298 196 1-NPC12 BC299 1749 1-NPC12 BC300 1613 1-NPC12 BC301 1442 1-NPC12 BC302 985 1-NPC12 BC303 1207 1-NPC12 BC304

twang15 commented 3 years ago

narrowPeak

  1. narrowPeak format -peak column
twang15 commented 3 years ago

From Fereshteh:

  1. Linked Reads Genomics - 10X Genomics
  2. Unique Molecule Identifier
  3. UMI-count modeling and differential expression analysis for single-cell RNA sequencing
  4. UMI reveal a novel sequencing artefact with implications for RNA-seq based gene expression analysis