rivas-lab / nanopore

Tools for analysis of long-read sequencing data.
0 stars 0 forks source link

count match/mismatch #12

Closed yk-tanigawa closed 7 years ago

yk-tanigawa commented 7 years ago
  1. How many FAST5 files do we have?
  2. Take several sample files, and convert them into FASTQ file
  3. Map to the reference genome (hg19)
  4. Bring in prior data (UKBB 500k individuals)
  5. Take platinum whole genome VCF file to compute $\theta$
yk-tanigawa commented 7 years ago

Step1, 2, 3 done

yk-tanigawa commented 7 years ago

For step5, I am downloading NA12878 data from Genome in a bottle consortia release v 3.3

yk-tanigawa commented 7 years ago

Count match/mismatch:

yk-tanigawa commented 7 years ago

Location of results file from LAST mapping

[ytanigaw@sherlock-ln01 login_node ~]$ ll -h /share/PI/mrivas/data/nanopore/*.maf
-rw-r--r-- 1 ytanigaw mrivas 88M Oct 25 22:47 /share/PI/mrivas/data/nanopore/20161006_minion_human_cDNA.maf
-rw-r--r-- 1 ytanigaw mrivas 86M Oct 25 23:56 /share/PI/mrivas/data/nanopore/20161008_wgs_caucasian_48hr.maf
yk-tanigawa commented 7 years ago

parameter for LAST aligner

yk-tanigawa commented 7 years ago

bwa alignment with MAPQ >= 60 && length >= 20 on chromosome 11

seq_id aligned insertion deletion soft clip H N P match mismatch
1 14563 247 1940 15350 0 0 0 0 0
2 1134 46 129 46745 0 0 0 0 0
3 1732 40 173 31619 0 0 0 0 0
4 19430 641 2078 15 0 0 0 0 0
5 4290 157 485 21985 0 0 0 0 0
6 24919 1687 1856 68 0 0 0 0 0
7 19965 560 2113 51 0 0 0 0 0

Need to discriminate match/mismatch

yk-tanigawa commented 7 years ago

BLAST

yk-tanigawa commented 7 years ago

manual inspection of alignment with IGV genome browser

seq_id pos len gene intron?
1 9162869 30160 DENND5A intron + exons
2 3729772 47925 NUP98 intron
3 70496710 33391 SHANK2 intron
4 124630341 20086 ESAM exons (coding + 5' UTL)
5 105992571 26432 - no annotated genes
6 88688000 26674 GRM5 intron
7 4484408 20576 - no annotated genes

Many deletions.

yk-tanigawa commented 7 years ago

count match mismatches

id = (match) X (mismatch) I (insertion) D (deletion) N S (soft clip) H P
1 13810 753 247 1940 0 15350 0 0
2 303 831 46 129 0 46745 0 0
3 470 1262 40 173 0 31619 0 0
4 5104 14326 641 2078 0 15 0 0
5 3794 496 157 485 0 21985 0 0
6 6641 18278 1687 1856 0 68 0 0
7 5288 14677 560 2113 0 51 0 0
yk-tanigawa commented 7 years ago

match/mismatches with Q-value

seq_id = (match) X (mismatch) I (insertion) D (deletion) N S (soft clip) H P
>=20 <20 >=20 <20 >=20 <20 >=20 <20 >=20 <20 >=20 <20 >=20 <20
1 1479 12331 13 740 13 234 1940 0 0 0 1280 14070 0 0 0 0
2 21 282 49 782 1 45 129 0 0 0 695 46050 0 0 0 0
3 57 413 107 1155 2 38 173 0 0 0 834 30785 0 0 0 0
4 481 4623 1195 13131 27 614 2078 0 0 0 0 15 0 0 0 0
5 312 3482 4 492 5 152 485 0 0 0 203 21782 0 0 0 0
6 456 6185 1186 17092 70 1617 1856 0 0 0 5 63 0 0 0 0
7 445 4843 1075 13602 16 544 2113 0 0 0 2 49 0 0 0 0
yk-tanigawa commented 7 years ago

mismatches with quality value >= 20 on seq. 1

pos (on chr11) ref read dbSNP 1.3.7
9163018 c A -
9168118 a G rs17256375
9168422 G A -
9171535 G A -
9171610 C A -
9172722 T A -
9174128 C A -
9174312 a T -
9177091 G A -
9177843 C A -
9178051 g A -
9179148 a T -
9179368 a T -

(Found a bug in my snippet. fixed and updated on 30 Nov at 13:10.)

yk-tanigawa commented 7 years ago

count.py under src/ can do this task.