build a variation graph from a collection of yeast genomes

ekg commented 8 years ago

If you have FASTA sequences, it should be straightforward enough to name them, concatenate them into one FASTA per genome, and run vg msga as such:

time vg msga -f yeasts.fa -B 1024 -k 22 -K 11 -X 1 -E 4 -Q 22 -D >yeasts.vg

Or, more verbosely:

time vg msga -f yeasts.fa \
    --band-width 1024 \
    --map-kmer-size 22 \
    --idx-kmer-size 11 \
    --idx-doublings 1 \
    --idx-edge-max 4 \
    --idx-prune-subs 22 \
    --debug >yeasts.vg

ekg commented 8 years ago

One alternative is to use vg construct to go from VCF files and the reference to a graph. You can also use this as the base graph in msga, to which it will align the other sequences. The Saccharomyces Genome Resequencing Project has a download page that includes VCF files for a number of yeast strains:

➜  Downloads  zcat SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz | vcfsamplenames 
BC187
DBVPG1106
DBVPG1373
DBVPG1788
DBVPG6044
DBVPG6765
L1374
L1528
SK1
UWOPS03-461.4
UWOPS83-787.3
UWOPS87-2421
W303
Y12
Y55
YJM975
YJM978
YJM981
YPS128

ekg commented 8 years ago

A paper describes the work: A High-Definition View of Functional Genetic Variation from Natural Yeast Genomes. Also, I'm waiting to hear back from Jared.

ekg commented 8 years ago

ftp://ftp.sanger.ac.uk/pub/users/dmc/yeast/SGRP2/

ekg commented 8 years ago

Also, this may be what we're after: http://www.tgac.ac.uk/news/232/15/Yeast-treasure-trove-goes-live/

nerdstrike commented 8 years ago

@markmcdowall might be able to help you with pombe data, help you find VCFs and such.

markmcdowall commented 8 years ago

For S. pombe there was a study of 57 of the lab strains of S. pombe that are most commonly used.

The paper you want is: http://europepmc.org/abstract/MED/25665008

The matching VCFs are in the EVA at: http://www.ebi.ac.uk/eva/?eva-study=PRJEB2733

willgdjones commented 8 years ago

Moving a previous conversation between me @ekg and @edawson over to this thread. Interestingly even with verbose output on, it doesn't seem clear why my reads aren't aligning.

The backstory, I'm comparing the process of aligning reads from http://opendata.ifr.ac.uk/NCYC/ (NCYC10 specifically) to the variation graph I've built and the SGD_2010 reference. I'm able to map to the variation graph but not to the linear reference.

bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 100 sequences (10100 bp)... [M::mem_process_seqs] Processed 100 reads in 0.012 CPU sec, 0.332 real sec [main] Version: 0.7.13-r1126 [main] CMD: bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 6.612 sec; CPU: 0.084 sec

Also for the pipeline aln + samse:

!bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sai [bwa_aln_core] calculate SA coordinate... 0.01 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 10.371 sec; CPU: 0.028 sec

!bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam [bwa_aln_core] convert to sequence coordinate... 0.03 sec [bwa_aln_core] refine gapped alignments... 0.00 sec [bwa_aln_core] print alignments... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 8.700 sec; CPU: 0.040 sec

!samtools flagstat NCYCaln-T10.sam 100 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

It does just seem like the reads aren't mapping. I'll try a different yeast strain which I know to be S. Cerevisiae.

ekg commented 8 years ago

Try aligning 100 reads with vg map with verbose debugging on (-D). What does it say?

On Mon, Apr 18, 2016 at 11:22 AM William Jones notifications@github.com wrote:

Moving a previous conversation between me @ekg https://github.com/ekg and @edawson https://github.com/edawson over to this thread. Interestingly even with verbose output on, it doesn't seem clear why my reads aren't aligning.

The backstory, I'm comparing the process of aligning reads from http://opendata.ifr.ac.uk/NCYC/ (NCYC10 specifically) to the variation graph I've built and the SGD_2010 reference. I'm able to map to the variation graph but not to the linear reference.

bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 100 sequences (10100 bp)... [M::mem_process_seqs] Processed 100 reads in 0.012 CPU sec, 0.332 real sec [main] Version: 0.7.13-r1126 [main] CMD: bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 6.612 sec; CPU: 0.084 sec

Also when doing it for

http://opendata.ifr.ac.uk/NCYC/

http://opendata.ifr.ac.uk/NCYC/

!bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq

NCYCaln-T10.sai [bwa_aln_core] calculate SA coordinate... 0.01 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 10.371 sec; CPU: 0.028 sec

!bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam [bwa_aln_core] convert to sequence coordinate... 0.03 sec [bwa_aln_core] refine gapped alignments... 0.00 sec [bwa_aln_core] print alignments... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 8.700 sec; CPU: 0.040 sec

!samtools flagstat NCYCaln-T10.sam 100 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-211292255

willgdjones commented 8 years ago

Here's what it looks like for the first read. This read maps to the variation graph but not to the linear reference. It seems to be finding mems effectively

!vg-81b5a2cb map -D -G -X 0.9 -f data/NCYC10_Run1/1NCYC10reads.fastq -x output/yeastgraph-m100.xg -g output/yeastgraph-m100-p4.gcsa > output/1NCYC10reads.gam

Loading xg index output/yeastgraph-m100.xg... Loading GCSA2 index output/yeastgraph-m100-p4.gcsa... Loading LCP index output/yeastgraph-m100-p4.gcsa... aligning {"quality": "IiIiJSAkJSUnJycmJycnKCkoJicmKCkpJicnIScnJyEnJycoKCQkJyYmKCcmKCkpKSIoKCgjJycoJygpEiUmJSEfISIlJScmKCcnJyQZIx8kHRYeHiIkIhohIyEeISAiIiIfIh8=", "sequence": "TATTTTGCTCTTTTCTGGCTTTGCTATAAAGTTCATAAAACATTATTCTATTAGAGGACTAGCTCAATTTTCATTCGATAGCATCTTTATCGTTATTATGA", "name": "HWI-ST319:276:D1BR0ACXX:7:1101:1116:2188 1:N:0:ATCACG"} mems before filling [["TATTTTGCTCTTTT",[]],["TTTTGCTCTTTTCTG",[]],["TGCTCTTTTCTGG",[]],["CTCTTTTCTGGC",[]],["TCTTTTCTGGCTT",[]],["TCTTTTCTGGCTTT",[]],["CTTTTCTGGCTTTG",[]],["TCTGGCTTTGCT",[]],["CTGGCTTTGCTA",[]],["GCTTTGCTAT",[]],["GCTTTGCTATA",[]],["GCTTTGCTATAA",[]],["CTTTGCTATAAA",[]],["TTTGCTATAAAGT",[]],["TTGCTATAAAGTT",[]],["GCTATAAAGTTC",[]],["CTATAAAGTTCA",[]],["CTATAAAGTTCAT",[]],["ATAAAGTTCATA",[]],["TAAAGTTCATAAA",[]],["AAAGTTCATAAAA",[]],["AAAGTTCATAAAACA",[]],["TTCATAAAACAT",[]],["TTCATAAAACATTAT",[]],["ATAAAACATTATT",[]],["TAAAACATTATTC",[]],["TAAAACATTATTCT",[]],["AAACATTATTCTA",[]],["ACATTATTCTAT",[]],["ACATTATTCTATT",[]],["CATTATTCTATTA",[]],["ATTATTCTATTAG",[]],["ATTCTATTAGA",[]],["ATTCTATTAGAG",[]],["TTCTATTAGAGGA",[]],["TATTAGAGGAC",[]],["TATTAGAGGACT",[]],["TTAGAGGACTA",[]],["TTAGAGGACTAG",[]],["TTAGAGGACTAGC",[]],["AGGACTAGCT",[]],["AGGACTAGCTCA",[]],["GACTAGCTCAAT",[]],["ACTAGCTCAATT",[]],["CTAGCTCAATTT",[]],["TAGCTCAATTTT",[]],["GCTCAATTTTC",[]],["GCTCAATTTTCA",[]],["CTCAATTTTCAT",[]],["CTCAATTTTCATT",[]],["TCAATTTTCATTC",[]],["AATTTTCATTCG",[]],["ATTTTCATTCGA",[]],["ATTTTCATTCGAT",[]],["TTTTCATTCGATAG",[]],["TCATTCGATAGC",[]],["TCATTCGATAGCA",[]],["ATTCGATAGCAT",[]],["TCGATAGCATC",[]],["TCGATAGCATCTT",[]],["CGATAGCATCTTTAT",[]],["TAGCATCTTTATCGT",[]],["CATCTTTATCGTTA",[]],["CATCTTTATCGTTATTAT",[]],["TATCGTTATTATG",[]],["TCGTTATTATGA",[]]] mems after filtering [["TATTTTGCTCTTTT",["458849:8"]],["TTTTGCTCTTTTCTG",["587307:32"]],["TGCTCTTTTCTGG",["754248:60"]],["CTCTTTTCTGGC",["161838:0"]],["TCTTTTCTGGCTT",["471056:20","620182:48"]],["TCTTTTCTGGCTTT",["471056:20"]],["CTTTTCTGGCTTTG",["420687:15"]],["TCTGGCTTTGCT",["691583:31"]],["CTGGCTTTGCTA",["143855:44","435460:66","634670:40"]],["GCTTTGCTAT",["44789:27","48958:3","54711:34","93361:70","386543:3","499676:19","502626:15","516985:26","647377:13","710274:36","717703:36","732156:23"]],["GCTTTGCTATA",["48958:3","54711:34","516985:26","717703:36","732156:23"]],["GCTTTGCTATAA",["54711:34"]],["CTTTGCTATAAA",["584676:36","638922:58","781922:35"]],["TTTGCTATAAAGT",["565437:90"]],["TTGCTATAAAGTT",["371940:63"]],["GCTATAAAGTTC",["88879:38","147055:25"]],["CTATAAAGTTCA",["148634:58","325868:16","404564:3","766676:6"]],["CTATAAAGTTCAT",["325868:16","766676:6"]],["ATAAAGTTCATA",["403193:44","628917:2"]],["TAAAGTTCATAAA",["276594:40"]],["AAAGTTCATAAAA",["126125:25","406841:25","517677:1","530178:41"]],["AAAGTTCATAAAACA",["406841:25","530178:41"]],["TTCATAAAACAT",["202193:21","318916:8","350442:61","571914:90"]],["TTCATAAAACATTAT",["318916:8"]],["ATAAAACATTATT",["37628:31","451365:15"]],["TAAAACATTATTC",["677054:3","745519:28"]],["TAAAACATTATTCT",["677054:3"]],["AAACATTATTCTA",["78831:30"]],["ACATTATTCTAT",["150957:54","183332:1","416058:38","425753:2","511228:34","511554:43","735341:7"]],["ACATTATTCTATT",["150957:54"]],["CATTATTCTATTA",["126151:23"]],["ATTATTCTATTAG",["387265:55"]],["ATTCTATTAGA",["107769:90","113002:17","338064:74","434236:7","480929:1","606976:13","759424:23"]],["ATTCTATTAGAG",["759424:23"]],["TTCTATTAGAGGA",["65330:95","765283:50"]],["TATTAGAGGAC",["713373:34","733531:1"]],["TATTAGAGGACT",["733531:1"]],["TTAGAGGACTA",["10601:65","593492:40","652337:49"]],["TTAGAGGACTAG",["593492:40","652337:49"]],["TTAGAGGACTAGC",["593492:40"]],["AGGACTAGCT",["52661:86","125077:37","227711:2","309070:70","744433:30"]],["AGGACTAGCTCA",["744433:30"]],["GACTAGCTCAAT",["713745:1"]],["ACTAGCTCAATT",["67818:37","564635:6"]],["CTAGCTCAATTT",["430488:38"]],["TAGCTCAATTTT",["506480:6"]],["GCTCAATTTTC",["401556:46","409100:2","512557:41","559702:80","669687:68","705882:12"]],["GCTCAATTTTCA",["401556:46","409100:2","559702:80","705882:12"]],["CTCAATTTTCAT",["12793:22","34924:0","669308:14","714469:33"]],["CTCAATTTTCATT",["714469:33"]],["TCAATTTTCATTC",["159190:12"]],["AATTTTCATTCG",["71309:10"]],["ATTTTCATTCGA",["363885:5","448792:34","567072:1","765258:66"]],["ATTTTCATTCGAT",["363885:5","567072:1"]],["TTTTCATTCGATAG",["263775:39"]],["TCATTCGATAGC",["406615:24","661628:33"]],["TCATTCGATAGCA",["406615:24"]],["ATTCGATAGCAT",["397453:46"]],["TCGATAGCATC",["415166:6","609877:9"]],["TCGATAGCATCTT",["415166:6"]],["CGATAGCATCTTTAT",["452838:38"]],["TAGCATCTTTATCGT",["588856:29"]],["CATCTTTATCGTTA",["267163:0","580741:71"]],["CATCTTTATCGTTATTAT",["580741:71"]],["TATCGTTATTATG",["575072:21"]],["TCGTTATTATGA",["181082:62","453155:16"]]] aligning to 112 clusters 18:1 580741-580741 15:1 452838-452838 15:1 530178-530178 15:1 406841-406841 15:1 587307-587307 15:1 588856-588856 15:1 318916-318916 14:1 267163-267163 14:1 471056-471056 14:1 458849-458849 14:1 420687-420687 14:1 263775-263775 14:1 677054-677054 13:1 406615-406615 13:1 593492-593492 13:1 325868-325868 13:1 65330-65330 13:1 714469-714469 13:1 371940-371940 13:1 387265-387265 13:1 745519-745519 13:1 575072-575072 13:1 567072-567072 13:1 363885-363885 13:1 754248-754248 13:1 565437-565437 13:1 415166-415166 13:1 517677-517677 13:1 451365-451365 13:1 37628-37628 13:1 765283-765283 13:1 766676-766676 13:1 276594-276594 13:1 78831-78831 13:1 620182-620182 13:1 150957-150957 13:1 126125-126125 13:1 159190-159190 13:1 126151-126151 12:1 511228-511228 12:1 759424-759424 12:1 559702-559702 12:1 765258-765258 12:1 511554-511554 12:1 661628-661628 12:1 691583-691583 12:1 506480-506480 12:1 705882-705882 12:1 713745-713745 12:1 564635-564635 12:1 733531-733531 12:1 669308-669308 12:1 571914-571914 12:1 652337-652337 12:1 638922-638922 12:1 634670-634670 12:1 584676-584676 12:1 628917-628917 12:1 744433-744433 12:1 781922-781922 12:1 735341-735341 12:1 397453-397453 12:1 143855-143855 12:1 147055-147055 12:1 148634-148634 12:1 161838-161838 12:1 181082-181082 12:1 183332-183332 12:1 202193-202193 12:1 88879-88879 12:1 71309-71309 12:1 350442-350442 12:1 67818-67818 12:1 401556-401556 12:1 403193-403193 12:1 409100-409100 12:1 404564-404564 12:1 416058-416058 12:1 425753-425753 12:1 430488-430488 12:1 54711-54711 12:1 12793-12793 12:1 453155-453155 12:1 448792-448792 12:1 34924-34924 12:1 435460-435460 11:1 480929-480929 11:1 113002-113002 11:1 713373-713373 11:1 48958-48958 11:1 107769-107769 11:1 717703-717703 11:1 732156-732156 11:1 10601-10601 11:1 669687-669687 11:1 512557-512557 11:1 516985-516985 11:1 434236-434236 11:1 338064-338064 11:1 606976-606976 11:1 609877-609877 10:1 93361-93361 10:1 309070-309070 10:1 227711-227711 10:1 386543-386543 10:1 647377-647377 10:1 52661-52661 10:1 710274-710274 10:1 44789-44789 10:1 125077-125077 10:1 502626-502626 10:1 499676-499676 attempt 1 on cluster 580741-580741 attempt 1 on subgraph 580734-580747 got 2 forward and 0 reverse mems softclip before 40 2 average node size 32 softclip after 40 2 attempt 2 on cluster 452838-452838 attempt 2 on subgraph 452830-452845 got 1 forward and 0 reverse mems softclip before 30 4 average node size 14 softclip after 30 4 attempt 3 on cluster 530178-530178 attempt 3 on subgraph 530170-530186 got 2 forward and 0 reverse mems softclip before 23 59 average node size 7 softclip after 23 21 softclip before 23 21 average node size 6 softclip after 23 21 attempt 4 on cluster 406841-406841 attempt 4 on subgraph 406833-406849 got 2 forward and 0 reverse mems softclip before 23 64 average node size 2 softclip after 23 3 softclip before 23 3 average node size 18 softclip after 23 3 attempt 5 on cluster 587307-587307 attempt 5 on subgraph 587300-587316 got 1 forward and 0 reverse mems softclip before 2 11 average node size 12 softclip after 2 6 softclip before 2 6 average node size 15 softclip after 2 6 attempt 6 on cluster 588856-588856 attempt 6 on subgraph 588848-588863 got 1 forward and 0 reverse mems softclip before 0 8 average node size 22 softclip after 0 8 attempt 7 on cluster 318916-318916 attempt 7 on subgraph 318908-318924 got 2 forward and 0 reverse mems softclip before 31 3 average node size 10 softclip after 31 3

ekg commented 8 years ago

This is saying that it is aligning with 40 soft clips. Is this the whole log?

We should check that it is passing the identity filter. I don't know if that is output to the log. It would be an easy and educational change to add debugging output for your investigation :)

On Mon, Apr 18, 2016, 12:24 William Jones notifications@github.com wrote:

Here's what it looks like for the first read. This read maps to the variation graph but not to the linear reference. It seems to be finding mems effectively

`!vg-81b5a2cb map -D -G -X 0.9 -f data/NCYC10_Run1/1NCYC10reads.fastq -x output/yeastgraph-m100.xg -g output/yeastgraph-m100-p4.gcsa > output/1NCYC10reads.gam

Loading xg index output/yeastgraph-m100.xg... Loading GCSA2 index output/yeastgraph-m100-p4.gcsa... Loading LCP index output/yeastgraph-m100-p4.gcsa... aligning {"quality": "IiIiJSAkJSUnJycmJycnKCkoJicmKCkpJicnIScnJyEnJycoKCQkJyYmKCcmKCkpKSIoKCgjJycoJygpEiUmJSEfISIlJScmKCcnJyQZIx8kHRYeHiIkIhohIyEeISAiIiIfIh8=", "sequence": "TATTTTGCTCTTTTCTGGCTTTGCTATAAAGTTCATAAAACATTATTCTATTAGAGGACTAGCTCAATTTTCATTCGATAGCATCTTTATCGTTATTATGA", "name": "HWI-ST319:276:D1BR0ACXX:7:1101:1116:2188 1:N:0:ATCACG"} mems before filling [["TATTTTGCTCTTTT",[]],["TTTTGCTCTTTTCTG",[]],["TGCTCTTTTCTGG",[]],["CTCTTTTCTGGC",[]],["TCTTTTCTGGCTT",[]],["TCTTTTCTGGCTTT",[]],["CTTTTCTGGCTTTG",[]],["TCTGGCTTTGCT",[]],["CTGGCTTTGCTA",[]],["GCTTTGCTAT",[]],["GCTTTGCTATA",[]],["GCTTTGCTATAA",[]],["CTTTGCTATAAA",[]],["TTTGCTATAAAGT",[]],["TTGCTATAAAGTT",[]],["GCTATAAAGTTC",[]],["CTATAAAGTTCA",[]],["CTATAAAGTTCAT",[]],["ATAAAGTTCATA",[]],["TAAAGTTCATAAA",[]],["AAAGTTCATAAAA",[]],["AAAGTTCATAAAACA",[]],["TTCATAAAACAT",[]],["TTCATAAAACATTAT",[]],["ATAAAACATTATT",[]],["TAAAACATTATTC",[]],["TAAAACATTATTCT",[]],["AAACATTATTCTA",[]],["ACATTATTCTAT",[]],["ACATTATTCTATT",[]],["CATTATTCTATTA",[]],["ATTATTCTATTAG",[]],["ATTCTATTAGA",[]],["ATTCTATTAGAG",[]],["TTCTATTAGAGGA",[]],["TATTAGAGGAC",[]],["TATTAGAGGACT",[]],["TTAGAGGACTA",[]],["TTAGAGGACTAG",[]],["TTAGAGGACTAGC",[]],["AGGACTAGCT",[]],["AGGACTAGCTCA",[]],["GACTAGCTCAAT",[]],["ACTAGCTCAATT",[]],["CTAGCTCAATTT",[]],["TAGCTCAATTTT",[]],["GCTCAATTTTC",[]],["GCTCAATTTTCA ",[]],["CTCAATTTTCAT",[]],["CTCAATTTTCATT",[]],["TCAATTTTCATTC",[]],["AATTTTCATTCG",[]],["ATTTTCATTCGA",[]],["ATTTTCATTCGAT",[]],["TTTTCATTCGATAG",[]],["TCATTCGATAGC",[]],["TCATTCGATAGCA",[]],["ATTCGATAGCAT",[]],["TCGATAGCATC",[]],["TCGATAGCATCTT",[]],["CGATAGCATCTTTAT",[]],["TAGCATCTTTATCGT",[]],["CATCTTTATCGTTA",[]],["CATCTTTATCGTTATTAT",[]],["TATCGTTATTATG",[]],["TCGTTATTATGA",[]]] mems after filtering [["TATTTTGCTCTTTT",["458849:8"]],["TTTTGCTCTTTTCTG",["587307:32"]],["TGCTCTTTTCTGG",["754248:60"]],["CTCTTTTCTGGC",["161838:0"]],["TCTTTTCTGGCTT",["471056:20","620182:48"]],["TCTTTTCTGGCTTT",["471056:20"]],["CTTTTCTGGCTTTG",["420687:15"]],["TCTGGCTTTGCT",["691583:31"]],["CTGGCTTTGCTA",["143855:44","435460:66","634670:40"]],["GCTTTGCTAT",["44789:27","48958:3","54711:34","93361:70","386543:3","499676:19","502626:15","516985:26","647377:13","710274:36","717703:36","732156:23"]],["GCTTTGCTATA",["48958:3","54711:34","516985:26","717703:36","732156:23"]],["GCTTTGCTATAA",["54711:34"]],["CTTTGCTATAAA",["584676:36","638922:58","781922:35"]],["TTTGCTATAAAGT",["565437:90"]],["TTGCTATAAAGTT",["371940:63"]],["GCTATAAAGTTC",["88879:38","147055:25"]],["CTATAAAGTTCA",["148634:58","325868:16","404564:3","766676:6"]],["CTATAAAGTTCAT",["325868:16","766676:6"]],["ATAAAGTTCATA",["403193:44","628917:2"]],["TAAAGTTCATAAA",["276594:40"]],["AAAGTTCATAAAA",["126125:25","406841:25","517677 :1","530178:41"]],["AAAGTTCATAAAACA",["406841:25","530178:41"]],["TTCATAAAACAT",["202193:21","318916:8","350442:61","571914:90"]],["TTCATAAAACATTAT",["318916:8"]],["ATAAAACATTATT",["37628:31","451365:15"]],["TAAAACATTATTC",["677054:3","745519:28"]],["TAAAACATTATTCT",["677054:3"]],["AAACATTATTCTA",["78831:30"]],["ACATTATTCTAT",["150957:54","183332:1","416058:38","425753:2","511228:34","511554:43","735341:7"]],["ACATTATTCTATT",["150957:54"]],["CATTATTCTATTA",["126151:23"]],["ATTATTCTATTAG",["387265:55"]],["ATTCTATTAGA",["107769:90","113002:17","338064:74","434236:7","480929:1","606976:13","759424:23"]],["ATTCTATTAGAG",["759424:23"]],["TTCTATTAGAGGA",["65330:95","765283:50"]],["TATTAGAGGAC",["713373:34","733531:1"]],["TATTAGAGGACT",["733531:1"]],["TTAGAGGACTA",["10601:65","593492:40","652337:49"]],["TTAGAGGACTAG",["593492:40","652337:49"]],["TTAGAGGACTAGC",["593492:40"]],["AGGACTAGCT",["52661:86","125077:37","227711:2","309070:70","744433:30"]],["AGGACTAGCTCA",["744433:30"]],["GACTAGCT CAAT",["713745:1"]],["ACTAGCTCAATT",["67818:37","564635:6"]],["CTAGCTCAATTT",["430488:38"]],["TAGCTCAATTTT",["506480:6"]],["GCTCAATTTTC",["401556:46","409100:2","512557:41","559702:80","669687:68","705882:12"]],["GCTCAATTTTCA",["401556:46","409100:2","559702:80","705882:12"]],["CTCAATTTTCAT",["12793:22","34924:0","669308:14","714469:33"]],["CTCAATTTTCATT",["714469:33"]],["TCAATTTTCATTC",["159190:12"]],["AATTTTCATTCG",["71309:10"]],["ATTTTCATTCGA",["363885:5","448792:34","567072:1","765258:66"]],["ATTTTCATTCGAT",["363885:5","567072:1"]],["TTTTCATTCGATAG",["263775:39"]],["TCATTCGATAGC",["406615:24","661628:33"]],["TCATTCGATAGCA",["406615:24"]],["ATTCGATAGCAT",["397453:46"]],["TCGATAGCATC",["415166:6","609877:9"]],["TCGATAGCATCTT",["415166:6"]],["CGATAGCATCTTTAT",["452838:38"]],["TAGCATCTTTATCGT",["588856:29"]],["CATCTTTATCGTTA",["267163:0","580741:71"]],["CATCTTTATCGTTATTAT",["580741:71"]],["TATCGTTATTATG",["575072:21"]],["TCGTTATTATGA",["181082:62","453155:16"]]] aligning to 112 clusters 18:1 580741-580741 15:1 452838-452838 15:1 530178-530178 15:1 406841-406841 15:1 587307-587307 15:1 588856-588856 15:1 318916-318916 14:1 267163-267163 14:1 471056-471056 14:1 458849-458849 14:1 420687-420687 14:1 263775-263775 14:1 677054-677054 13:1 406615-406615 13:1 593492-593492 13:1 325868-325868 13:1 65330-65330 13:1 714469-714469 13:1 371940-371940 13:1 387265-387265 13:1 745519-745519 13:1 575072-575072 13:1 567072-567072 13:1 363885-363885 13:1 754248-754248 13:1 565437-565437 13:1 415166-415166 13:1 517677-517677 13:1 451365-451365 13:1 37628-37628 13:1 765283-765283 13:1 766676-766676 13:1 276594-276594 13:1 78831-78831 13:1 620182-620182 13:1 150957-150957 13:1 126125-126125 13:1 159190-159190 13:1 126151-126151 12:1 511228-511228 12:1 759424-759424 12:1 559702-559702 12:1 765258-765258 12:1 511554-511554 12:1 661628-661628 12:1 691583-691583 12:1 506480-506480 12:1 705882-705882 12:1 713745-713745 12:1 564635-564635 12:1 733531-733531 12:1 669308-669308 12:1 571914-571914 12:1 652337-652337 12:1 638922-638922 12:1 634670-634670 12:1 584676-584676 12:1 628917-628917 12:1 744433-744433 12:1 781922-781922 12:1 735341-735341 12:1 397453-397453 12:1 143855-143855 12:1 147055-147055 12:1 148634-148634 12:1 161838-161838 12:1 181082-181082 12:1 183332-183332 12:1 202193-202193 12:1 88879-88879 12:1 71309-71309 12:1 350442-350442 12:1 67818-67818 12:1 401556-401556 12:1 403193-403193 12:1 409100-409100 12:1 404564-404564 12:1 416058-416058 12:1 425753-425753 12:1 430488-430488 12:1 54711-54711 12:1 12793-12793 12:1 453155-453155 12:1 448792-448792 12:1 34924-34924 12:1 435460-435460 11:1 480929-480929 11:1 113002-113002 11:1 713373-713373 11:1 48958-48958 11:1 107769-107769 11:1 717703-717703 11:1 732156-732156 11:1 10601-10601 11:1 669687-669687 11:1 512557-512557 11:1 516985-516985 11:1 434236-434236 11:1 338064-338064 11:1 606976-606976 11:1 609877-609877 10:1 93361-93361 10:1 309070-309070 10:1 227711-227711 10:1 386543-386543 10:1 647377-647377 10:1 52661-52661 10:1 710274-710274 10:1 44789-44789 10:1 125077-125077 10:1 502626-502626 10:1 499676-499676 attempt 1 on cluster 580741-580741 attempt 1 on subgraph 580734-580747 got 2 forward and 0 reverse mems softclip before 40 2 average node size 32 softclip after 40 2 attempt 2 on cluster 452838-452838 attempt 2 on subgraph 452830-452845 got 1 forward and 0 reverse mems softclip before 30 4 average node size 14 softclip after 30 4 attempt 3 on cluster 530178-530178 attempt 3 on subgraph 530170-530186 got 2 forward and 0 reverse mems softclip before 23 59 average node size 7 softclip after 23 21 softclip before 23 21 average node size 6 softclip after 23 21 attempt 4 on cluster 406841-406841 attempt 4 on subgraph 406833-406849 got 2 forward and 0 reverse mems softclip before 23 64 average node size 2 softclip after 23 3 softclip before 23 3 average node size 18 softclip after 23 3 attempt 5 on cluster 587307-587307 attempt 5 on subgraph 587300-587316 got 1 forward and 0 reverse mems softclip before 2 11 average node size 12 softclip after 2 6 softclip before 2 6 average node size 15 softclip after 2 6 attempt 6 on cluster 588856-588856 attempt 6 on subgraph 588848-588863 got 1 forward and 0 reverse mems softclip before 0 8 average node size 22 softclip after 0 8 attempt 7 on cluster 318916-318916 attempt 7 on subgraph 318908-318924 got 2 forward and 0 reverse mems softclip before 31 3 average node size 10 softclip after 31 3`

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-211316056

willgdjones commented 8 years ago

Yep it's the whole log from what is returned - and yes that would be a great exercise :D

willgdjones commented 8 years ago

So the reads from a different strain NCYC88 map just fine - I think it must be do to this.

!samtools flagstat NCYC88aln.sam

100 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 93 + 0 mapped (93.00% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

willgdjones commented 8 years ago

@ekg @edawson Is there a way I can quickly build recent versions of VG up on the farm? Development seems to be moving quite quickly and I'm on an old version (vg-81b5a2cb)

ekg commented 8 years ago

Yes, I build on the farm. You will need to have a new version of g++ in your path and also bison. There might be something else I'm forgetting. It is awkward to build there because we don't have control over the package manager.

On Mon, Apr 18, 2016, 15:07 William Jones notifications@github.com wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Is there a way I can quickly build recent versions of VG up on the farm? Development seems to be moving quite quickly and I'm on an old version.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-211372372

ekg commented 8 years ago

Try adding /nfs/users/nfs_e/eg10/bin and /lustre/scratch113/projects/graphs/bin to your PATH. Put them in front so the executables there take precedence. Now try to build vg.

On Mon, Apr 18, 2016, 15:23 Erik Garrison erik.garrison@gmail.com wrote:

Yes, I build on the farm. You will need to have a new version of g++ in your path and also bison. There might be something else I'm forgetting. It is awkward to build there because we don't have control over the package manager.

On Mon, Apr 18, 2016, 15:07 William Jones notifications@github.com wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Is there a way I can quickly build recent versions of VG up on the farm? Development seems to be moving quite quickly and I'm on an old version.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-211372372

willgdjones commented 8 years ago

I'm redownloading everything on the farm - is there a better way to do this?

willgdjones commented 8 years ago

@ekg All goes well except jansson isn't is available!

ekg commented 8 years ago

I guess you will need to install it in your home directory. Build with configure --prefix=$HOME.

On Mon, Apr 18, 2016, 16:48 William Jones notifications@github.com wrote:

@ekg https://github.com/ekg All goes well except jansson isn't is available!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-211411754

willgdjones commented 8 years ago

Hey @ekg @edawson noticing some strange behaviour in the recent version of vg map. I'm simulating perfect reads from the graph but the alignment scores vary quite a bit.

Simulate 10 perfect reads

!vg sim -l 101 -n 10 -e 0.00 -i 0.00 -x data/graphs/yeastgraph-m100.xg > data/raw/reads/sim10perf.reads

Map these reads

!time vg map -r data/raw/reads/sim10perf.reads -x data/graphs/yeastgraph-m100.xg -g data/graphs/yeastgraph-m100-p4.gcsa > data/GAM/sim10perf.gam

real 0m0.905s user 0m0.444s sys 0m0.260s

Look at scores

!vg view -a data/GAM/sim10perf.gam | jq '.score'

16 14 101 17 15 15 16 16 101 17

willgdjones commented 8 years ago

Along this theme: it seems that vg map doesn't quite work as well as BWA mem. Here's a histogram of alignment scores from vg map.

download

Many of them are near zero, whereas bwa mem maps a very high percentage of them.

1001 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 1 + 0 supplementary 0 + 0 duplicates 964 + 0 mapped (96.30%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Possible reasons: Vg map is not mapping the reverse strand, around half of my reads mapping to the reverse strand in bwa mem.

iqbal-lab commented 8 years ago

Hi @willgdjones - why do you expect there to be less variation in the mapping qualities? Even with perfect reads, mapping quality will depend on whether the read comes from a repeat. Does bwa_mem give a very different histo with them all more confident?

ekg commented 8 years ago

Do you know for sure that exactly the same reads are present in each histogram? I ask because it looks like there are only 425 reads in the biggest bin of the bwa set, but 2x this number in the vg one.

On Thu, Apr 21, 2016 at 10:40 AM William Jones notifications@github.com wrote:

Yeah I just checked it out, it does [image: download 1] https://cloud.githubusercontent.com/assets/1719848/14704604/6a745cd2-07ad-11e6-8e76-19eaf7340f35.png

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-212832618

willgdjones commented 8 years ago

@ekg Yep I just noticed this! Quick spot..

ekg commented 8 years ago

Among those reads that are in both sets, do a scatter plot where X=bwa mem mapping quality and Y=vg alignment score.

On Thu, Apr 21, 2016 at 10:42 AM William Jones notifications@github.com wrote:

@ekg https://github.com/ekg Yep I just noticed this! Quick spot..

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-212833307

willgdjones commented 8 years ago

So this is the revised bwa mem version. The 0 quality scores correspond with the reads that weren't mapped at all.

download 2

ekg commented 8 years ago

How was the index generated and how is the mapping working?

Note that the alignment score is not equivalent to the mapping quality. They may be correlated depending on how they are derived, but they are no the same thing.

What do you get when you plot X=bwa mappping quality and Y=vg alignment quality?

willgdjones commented 8 years ago

The reads in the sam output are in the different order to the input fastq order. Just fiddling and joining on the reads now to get them to match up!

willgdjones commented 8 years ago

Chatting with Richard it seems likely that the reason for half of the reads mapping unsuccessfully is that they need reverse complementing and that this is not done automatically within vg map.

ekg commented 8 years ago

This is done automatically within vg map.

On Thu, Apr 21, 2016 at 1:58 PM William Jones notifications@github.com wrote:

Chatting with Richard it seems likely that the reason for half of the reads mapping unsuccessfully is that they need reverse complementing and that this is not done automatically within vg map.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-212905803

ekg commented 8 years ago

I see, you've probably generated the gcsa2 index without reverse kmers. What documentation suggests to do that? We should correct it. There is no need to generate kmers only from the reverse strand.

You disable this behavior by removing -F from the vg kmers call in the indexing process.

On Thu, Apr 21, 2016 at 2:14 PM Erik Garrison erik.garrison@gmail.com wrote:

This is done automatically within vg map.

On Thu, Apr 21, 2016 at 1:58 PM William Jones notifications@github.com wrote:

Chatting with Richard it seems likely that the reason for half of the reads mapping unsuccessfully is that they need reverse complementing and that this is not done automatically within vg map.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-212905803

willgdjones commented 8 years ago

I see! That makes a lot of sense. Somehow this came from where we were discussing on the gist, can't immediately find where the -F came from prior to that.

https://gist.github.com/willgdjones/8a3b2ac59a645d4c033a4f0abb22728a

ekg commented 8 years ago

You want this part: https://github.com/vgteam/vg/wiki/working-with-a-whole-genome-variation-graph#indexing-with-xg-and-gcsa2

Let's work from that page on the wiki to make sure we've got the right docs up and are on the same page. Can you edit it?

On Thu, Apr 21, 2016 at 2:41 PM William Jones notifications@github.com wrote:

I see, some this came from where we were discussing on the gist https://gist.github.com/willgdjones/8a3b2ac59a645d4c033a4f0abb22728a

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/189#issuecomment-212923967

willgdjones commented 8 years ago

Yep I can edit this - it's good as it is though for now though. I can also put my work notebooks up as extra documentation.

ekg commented 6 years ago

Since this thread was last updated we've made this kind of resequencing analysis routine.

Thanks to everyone who participated, in particular @willgdjones who's focus on the yeast data set really helped to kickstart serious improvement in vg map through rigorous testing.

ekg commented 6 years ago

I should leave this here. It implements indexing of the SGRP2, including pruning with GBWT path filling into the gaps to build an GCSA2 index with order 256:

vg construct -a -r SGD_2010.fasta -v SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -p >SGRP2-cerevisiae.vg
vg index -p -x SGRP2-cerevisiae.haps.xg -v SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -G SGRP2-cerevisiae.gbwt SGRP2-cerevisiae.vg
vg prune -p -u -g SGRP2-cerevisiae.gbwt -m SGRP2-cerevisiae.unfold.mapping SGRP2-cerevisiae.vg >SGRP2-cerevisiae.prune.vg
mkdir -p work
vg index -b work/ -p -g SGRP2-cerevisiae.gcsa -f SGRP2-cerevisiae.unfold.mapping -k 16 SGRP2-cerevisiae.prune.vg

vgteam / vg

build a variation graph from a collection of yeast genomes #189