yk-tanigawa commented 7 years ago

How many FAST5 files do we have?
Take several sample files, and convert them into FASTQ file
Map to the reference genome (hg19)
Bring in prior data (UKBB 500k individuals)
Take platinum whole genome VCF file to compute $\theta$

yk-tanigawa commented 7 years ago

Step1, 2, 3 done

yk-tanigawa commented 7 years ago

For step5, I am downloading NA12878 data from Genome in a bottle consortia release v 3.3

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest

yk-tanigawa commented 7 years ago

Count match/mismatch:

gap letter is - not _ in maf file generated by LAST aligner
consider upper/lower letter in the output file

yk-tanigawa commented 7 years ago

Location of results file from LAST mapping

[ytanigaw@sherlock-ln01 login_node ~]$ ll -h /share/PI/mrivas/data/nanopore/*.maf
-rw-r--r-- 1 ytanigaw mrivas 88M Oct 25 22:47 /share/PI/mrivas/data/nanopore/20161006_minion_human_cDNA.maf
-rw-r--r-- 1 ytanigaw mrivas 86M Oct 25 23:56 /share/PI/mrivas/data/nanopore/20161008_wgs_caucasian_48hr.maf

script file for mapping:
- https://github.com/rivas-lab/nanopore/blob/20161025_last_align_cDNA_success/scripts/20161025/last_cDNA.sbatch

yk-tanigawa commented 7 years ago

parameter for LAST aligner

http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3290.html#supplementary-information
- http://www.nature.com/nmeth/journal/v12/n4/extref/nmeth.3290-S1.pdf
- Sup. table 2 -s 2 -T 0 -Q 0 -r 1 -a 1 -b 1 -q 1
- http://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-22
- http://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0043-z

yk-tanigawa commented 7 years ago

bwa alignment with MAPQ >= 60 && length >= 20 on chromosome 11

seq_id	aligned	insertion	deletion	soft clip
1	14563	247	1940	15350
2	1134	46	129	46745
3	1732	40	173	31619
4	19430	641	2078	15
5	4290	157	485	21985
6	24919	1687	1856	68
7	19965	560	2113	51

https://samtools.github.io/hts-specs/SAMv1.pdf
- match and mismatch are grouped as aligned
- aligned := " alignment match (can be a sequence match or mismatch)"
- soft clip := " soft clipping (clipped sequences present in SEQ)"

Need to discriminate match/mismatch

yk-tanigawa commented 7 years ago

BLAST

yk-tanigawa commented 7 years ago

manual inspection of alignment with IGV genome browser

seq_id	pos	len	gene	intron?
1	9162869	30160	DENND5A	intron + exons
2	3729772	47925	NUP98	intron
3	70496710	33391	SHANK2	intron
4	124630341	20086	ESAM	exons (coding + 5' UTL)
5	105992571	26432	-	no annotated genes
6	88688000	26674	GRM5	intron
7	4484408	20576	-	no annotated genes

Many deletions.

yk-tanigawa commented 7 years ago

count match mismatches

I have counted match/mismatches with original script (https://github.com/rivas-lab/nanopore/blob/201611281930count/notes/20161128_count.ipynb) and obtained the following statistics

id	= (match)	X (mismatch)	I (insertion)	D (deletion)	S (soft clip)
1	13810	753	247	1940	15350
2	303	831	46	129	46745
3	470	1262	40	173	31619
4	5104	14326	641	2078	15
5	3794	496	157	485	21985
6	6641	18278	1687	1856	68
7	5288	14677	560	2113	51

yk-tanigawa commented 7 years ago

match/mismatches with Q-value

seq_id	=	(match)	X	(mismatch)	I	(insertion)	D	(deletion)	N		S	(soft clip)	H		P
	>=20	<20	>=20	<20	>=20	<20			>=20	<20	>=20	<20	>=20	<20	>=20	<20
1	1479	12331	13	740	13	234	1940	0	0	0	1280	14070	0	0	0	0
2	21	282	49	782	1	45	129	0	0	0	695	46050	0	0	0	0
3	57	413	107	1155	2	38	173	0	0	0	834	30785	0	0	0	0
4	481	4623	1195	13131	27	614	2078	0	0	0	0	15	0	0	0	0
5	312	3482	4	492	5	152	485	0	0	0	203	21782	0	0	0	0
6	456	6185	1186	17092	70	1617	1856	0	0	0	5	63	0	0	0	0
7	445	4843	1075	13602	16	544	2113	0	0	0	2	49	0	0	0	0

yk-tanigawa commented 7 years ago

mismatches with quality value >= 20 on seq. 1

pos (on chr11)	ref	read	dbSNP 1.3.7
9163018	c	A	-
9168118	a	G	rs17256375
9168422	G	A	-
9171535	G	A	-
9171610	C	A	-
9172722	T	A	-
9174128	C	A	-
9174312	a	T	-
9177091	G	A	-
9177843	C	A	-
9178051	g	A	-
9179148	a	T	-
9179368	a	T	-

(Found a bug in my snippet. fixed and updated on 30 Nov at 13:10.)

yk-tanigawa commented 7 years ago

count.py under src/ can do this task.

rivas-lab / nanopore

count match/mismatch #12

parameter for LAST aligner

bwa alignment with MAPQ >= 60 && length >= 20 on chromosome 11

BLAST

manual inspection of alignment with IGV genome browser

count match mismatches

mismatches with quality value >= 20 on seq. 1