vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

vg map produces invalid gams on longer reads (duplicate ranks, inserts longer than read length) #1417

Closed wheaton5 closed 6 years ago

wheaton5 commented 6 years ago

was running something like vg augment vg.vg gam.gam -q 15 > augmented.vg

vg.vg had been created from a linear reference + vcfs and was chromosome chunked but put in the same id space then combined with cat vg1.vg vg2.vg... > vg.vg

Also gam.gam was created from multiple gam files with cat. The gam files were fermikit unitigs aligned to the graph reference. Let me know if any more information would be useful.

Error determining rank of mapping 1 in path : {"position": {"node_id": 7967739, "offset": 249, "is_reverse": true}, "edit": [{"from_length": 50, "to_length": 50}, {"to_length": 15, "sequence": "AAGCGGTAGTGGTGG"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 22, "to_length": 22}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 6, "to_length": 6}], "rank": 1}
Error determining rank of mapping 2 in path : {"position": {"node_id": 7967739, "offset": 332, "is_reverse": true}, "edit": [{"from_length": 20, "to_length": 20}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 21, "to_length": 21}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 5, "to_length": 5}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 10, "to_length": 10}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 76, "to_length": 76}], "rank": 1}
vg: src/pileup.cpp:256: void vg::Pileups::compute_from_alignment(vg::Alignment&): Assertion `alignment.sequence().empty() || alignment.path().mapping_size() == 0 || read_offset == alignment.sequence().length()' failed.
Got signal 6
Manual stack trace:
Stack trace from backtrace() for signal 6:
[0x83cb07]
=================
glennhickey commented 6 years ago

Just looking at the code... It wants Alignment Paths in the GAM to either:

From your output, it looks like you have a Path where the first two Mappings have rank 1, which it doesn't know how to deal with. It's also invalid and may be a warning sign that something went wrong in the GAM creation.

Fixing/removing the ranks should make these errors go away. Changing the code to ignore ranks altogether is another option.

On Fri, Feb 2, 2018 at 8:40 AM, Haynes Heaton notifications@github.com wrote:

was running something like vg augment vg.vg gam.gam -q 15 > augmented.vg

vg.vg had been created from a linear reference + vcfs and was chromosome chunked but put in the same id space then combined with cat vg1.vg vg2.vg... > vg.vg

Also gam.gam was created from multiple gam files with cat. The gam files were fermikit unitigs aligned to the graph reference. Let me know if any more information would be useful.

error im getting is Error determining rank of mapping 1 in path : {"position": {"node_id": 7967739, "offset": 249, "is_reverse": true}, "edit": [{"from_length": 50, "to_length": 50}, {"to_length": 15, "sequence": "AAGCGGTAGTGGTGG"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 22, "to_length": 22}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 6, "to_length": 6}], "rank": 1} Error determining rank of mapping 2 in path : {"position": {"node_id": 7967739, "offset": 332, "is_reverse": true}, "edit": [{"from_length": 20, "to_length": 20}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 21, "to_length": 21}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 5, "to_length": 5}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 10, "to_length": 10}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 76, "to_length": 76}], "rank": 1} vg: src/pileup.cpp:256: void vg::Pileups::compute_from_alignment(vg::Alignment&): Assertion `alignment.sequence().empty() || alignment.path().mapping_size() == 0 || read_offset == alignment.sequence().length()' failed. Got signal 6 Manual stack trace: Stack trace from backtrace() for signal 6: [0x83cb07]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7nX0b0JFy7Y0_4tKKnfIg5XNOrYNks5tQxBYgaJpZM4R3MkZ .

wheaton5 commented 6 years ago

So how do i make it have no ranks in mapping? I'm using default options for my vg map call. These are assembly unitigs from fermikit. so im just using the required -f -x and -g for fastq, xg, and gcsa

glennhickey commented 6 years ago

Are you able to share the read and index that you're using? Or, if not, could you post the full GAM alignment of one of the reads in question. For the first, you should be able to get it from something like (or search for the whole mapping if your're better at grepping special characters than me):

vg view -a gam.gam | grep 7967739 | grep 249 | grep AAGCGGTAGTGGTGG

On Tue, Feb 6, 2018 at 4:31 PM, Haynes Heaton notifications@github.com wrote:

So how do i make it have no ranks in mapping? I'm using default options for my vg map call. These are assembly unitigs from fermikit. so im just using the required -f -x and -g for fastq, xg, and gcsa

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-363571250, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7q4S0zRXj7PA0uyLCP8Hveof8nlPks5tSMSigaJpZM4R3MkZ .

wheaton5 commented 6 years ago

Edit: much smaller example now available here https://drive.google.com/open?id=1DoFu3yRs9Zeuyh1pw3Z1dfLoGPkuW0eV each file at most a few 10s of mb

The xg is 1gb and there are 7 vg files each a couple tens of mb. But i subseted the gam to the first 1k reads and it still fails. So I uploaded everything necessary as well as the gcsa and shortened fastq used to generate the gam to my google drive at https://drive.google.com/open?id=1Vz96DCxWtnhP4ESGITXSI55hVuzanT78

So that first mapping in full is the following

{"sequence": "CAACCAATGTAGCACAGATTACCGTTTGATCGTGTTTGGCTTGCTGGTGGAAGCGGTAGTGGTGGTTTCAGGTCCTTGAGTGCTTTGTACCCACTCTCGTTACTTTTATCGAGCGTTTACACTCGAACGAGGAGTGGAGCGAAAGACAAGTAAACTAACTGAAAAAAGCATCCAACTTTTCCAACTGTTGCTTATCGATGCTTTTTACTTACTTTCGATTGTACCCAGAACACAGGGAGAGCCT", "path": {"mapping": [{"position": {"node_id": 7967739, "offset": 101, "is_reverse": true}, "edit": [{"from_length": 7, "to_length": 7}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 97, "to_length": 97}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 2, "to_length": 2}, {"from_length": 1}, {"from_length": 2, "to_length": 2}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 2, "to_length": 2}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 29, "to_length": 29}], "rank": 1}, {"position": {"node_id": 7967739, "offset": 249, "is_reverse": true}, "edit": [{"from_length": 50, "to_length": 50}, {"to_length": 15, "sequence": "AAGCGGTAGTGGTGG"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 22, "to_length": 22}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 6, "to_length": 6}], "rank": 1}, {"position": {"node_id": 7967739, "offset": 332, "is_reverse": true}, "edit": [{"from_length": 20, "to_length": 20}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 21, "to_length": 21}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 5, "to_length": 5}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 10, "to_length": 10}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 76, "to_length": 76}], "rank": 1}, {"position": {"node_id": 7967737, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 7967736, "is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 3}]}, "name": "45165544:29134291\t62\t.\t80228819,56;121709999,77;", "quality": "BgYGBgYHBwcHBwcHCAgICAkJCQoLCwsLCwsLCwsLCwwMDAwNDQ4ODg4PEBARERESEhITFBQUFBUVFRUVFhYWFhYWFhYVFRUVFRUVFRUUExQVFRUVFRYWFhYXFxcXFxgYGBgXFxYWFhYWFRUVFRYWFhUVFRUVFRUVFBQUFBQUFBQUFBQTFBQVFBQUFBQUExISEREREBAQDw8PEBEQEBERERAQEBAQEBARERAQEREREREREREQDw8PDw8ODg4ODg8PDw8ODg4ODg4ODxAQEBAQEBAPDw8PEBAQDw8PDg4ODg8PDw8QEBAQEA8QEBAQDw8QEBAQEBAQEREREREQEA8ODw8ODw8PEBEREREREBAQEA8PDw8PDw8PEBARERERERERERAPDw8QEBAQERESEhEQEBERERERERISEhEREREREREREREQEBAQDw8PDw8PDg0NDQ0NDAwMDAwMDAsLCwsLCwsLCwoKCgkJCQgHBwcHBwcHBwcHBwcHBwcHBwYGBQUFBQUFBQUFBQ==", "mapping_quality": 60, "score": 175, "identity": 0.93350383631713552}

but that warning is not what it is failing on. It appears to be failing on the following mapping

{"sequence": "TTGAATGAAAATAGCTGAGAGAGAGAGAGAGAAAAAGAGAAGTAAAACAAACCTATAAATAGACACACAGTTACACAATAAACAACAATAAACGTTATTGTATGGAACATATGTGTTGTTGTGATTCTTTCTTACGTGTATGTAATTGCCCCGAAATGGTTGCTACTGAGGTGTTGGATAGCGAATCGATTGCCTTGTGATTGAAATAC", "path": {"mapping": [{"position": {"node_id": 1254943, "offset": 7, "is_reverse": true}, "edit": [{"from_length": 18, "to_length": 18}], "rank": 1}, {"position": {"node_id": 1254942, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 1254940, "is_reverse": true}, "edit": [{"from_length": 74, "to_length": 74}], "rank": 3}, {"position": {"node_id": 1254939, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 1254937, "is_reverse": true}, "edit": [{"from_length": 12, "to_length": 12}], "rank": 5}, {"position": {"node_id": 1254935, "is_reverse": true}, "edit": [{"from_length": 19, "to_length": 19}], "rank": 6}, {"position": {"node_id": 1254935, "offset": 19, "is_reverse": true}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 4}, {"position": {"node_id": 1254933, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 1254932, "is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 6}, {"position": {"node_id": 1254930, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 7}, {"position": {"node_id": 1254929, "is_reverse": true}, "edit": [{"from_length": 44, "to_length": 44}], "rank": 8}, {"position": {"node_id": 1254929, "offset": 44, "is_reverse": true}, "edit": [{"from_length": 28, "to_length": 28}], "rank": 1}, {"position": {"node_id": 1254927, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 1254925, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 3}, {"position": {"node_id": 1254924, "is_reverse": true}, "edit": [{"from_length": 8, "to_length": 8}], "rank": 4}, {"position": {"node_id": 1254923, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 1254921, "is_reverse": true}, "edit": [{"from_length": 86, "to_length": 86}], "rank": 6}]}, "name": "85221274:114987190\t33\t137049119,89;\t40979492,86;126705465,77;", "quality": "AQEBAQEBAgICAwMEBAQEBAQEBQUFBQYHBwcHCAgICAgICAgICQkJCQoKCgoKCgoKCgoKCgoKCgoKCgoKCgoLCwsLCwsLCwsLCwsMDA0ODg4ODg4PDw8PDw8PDw8PDw8PDxAQEA8PDw8QEA8PEA8PDg4ODg4ODg0ODg4NDAwMDAwMDAwMDA0NDQwMDAwLCwsLCwsLCwsLCwwMDQ0ODg4ODg4ODQ0ODw8PDw8QEREREBEQDw8PDw8PDg4ODg4ODg4PDw8PDw8ODg4ODg4ODQ0NDQ0NDQ0NDQ0NDg4ODQ0NDQ0NDQ0MDAwMDAwLDAwMDAwMDAwMDAwMDAwMDAwLCwoKCQkJCQkJCQkJCAcHBwcHBgUFBQUEBAQEBAQEBAQEBAQEBAQEAwMDAwMDAwMDAwMDAwMDAwMCAgICAgICAgEBAQEBAQEBAQEBAQEBAQEBAQ==", "mapping_quality": 60, "score": 178, "identity": 1.0}

And if I remove the assert it gets further but then fails with the following error

terminate called after throwing an instance of 'std::runtime_error'
  what():  Found pileup for nonexistent edge {"from": 9227995, "to": 9228001}
Got signal 6
Manual stack trace:
Stack trace from backtrace() for signal 6:
vg(vg::stacktrace_with_backtrace_and_exit(int)+0x27) [0x9e1697]
=================
vg(vg::emit_stacktrace(int, siginfo*, void*)+0x81) [0x9e19e1]
=================
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x2af634157cb0]
=================
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x2af636376035]
=================
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x2af63637979b]
=================
/software/gcc-6.2.0/lib64/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x2af635900b5d]
=================
/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8eb16) [0x2af6358feb16]
=================
/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8eb61) [0x2af6358feb61]
=================
/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8ed78) [0x2af6358fed78]
=================
vg() [0x8f30c9]
=================
vg(vg::Pileups::for_each_edge_pileup(std::function<void (vg::EdgePileup&)> const&)+0x91) [0xb02aa1]
=================
vg(main_augment(int, char**)+0x19f5) [0x8f6915]
=================
vg(vg::subcommand::Subcommand::operator()(int, char**) const+0x28) [0x98b2e8]
=================
vg(main+0xa7) [0x5ae457]
=================
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x2af6363617ed]
=================
vg() [0x65a671]
=================
glennhickey commented 6 years ago

Thanks!

The full alignments you posted definitely contain duplicate ranks. I'm seeing rank 1 three times in the the first one and and various ranks duplicated in the second one.

@ekg Any ideas why map would be producing ranks like this?

On Wed, Feb 7, 2018 at 7:02 AM, Haynes Heaton notifications@github.com wrote:

The xg is 1gb and there are 7 vg files each a couple tens of mb. But i subseted the gam to the first 1k reads and it still fails. So I uploaded everything necessary as well as the gcsa and shortened fastq used to generate the gam to my google drive at https://drive.google.com/open?id= 1Vz96DCxWtnhP4ESGITXSI55hVuzanT78 xg and gcsa still uploading as of this post, will edit this away when complete

So that first read in full is the following

{"sequence": "CAACCAATGTAGCACAGATTACCGTTTGATCGTGTTTGGCTTGCTGGTGGAAGCGGTAGTGGTGGTTTCAGGTCCTTGAGTGCTTTGTACCCACTCTCGTTACTTTTATCGAGCGTTTACACTCGAACGAGGAGTGGAGCGAAAGACAAGTAAACTAACTGAAAAAAGCATCCAACTTTTCCAACTGTTGCTTATCGATGCTTTTTACTTACTTTCGATTGTACCCAGAACACAGGGAGAGCCT", "path": {"mapping": [{"position": {"node_id": 7967739, "offset": 101, "is_reverse": true}, "edit": [{"from_length": 7, "to_length": 7}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 97, "to_length": 97}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 2, "to_length": 2}, {"from_length": 1}, {"from_length": 2, "to_length": 2}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 2, "to_length": 2}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 29, "to_length": 29}], "rank": 1}, {"position": {"node_id": 7967739, "offset": 249, "is_reverse": true}, "edit": [{"from_length": 50, "to_length": 50}, {"to_length": 15, "sequence": "AAGCGGTAGTGGTGG"}, {"from_length": 3, "to_length": 3}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 22, "to_length": 22}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 6, "to_length": 6}], "rank": 1}, {"position": {"node_id": 7967739, "offset": 332, "is_reverse": true}, "edit": [{"from_length": 20, "to_length": 20}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 21, "to_length": 21}, {"from_length": 1, "to_length": 1, "sequence": "G"}, {"from_length": 5, "to_length": 5}, {"from_length": 1, "to_length": 1, "sequence": "C"}, {"from_length": 10, "to_length": 10}, {"from_length": 1, "to_length": 1, "sequence": "A"}, {"from_length": 76, "to_length": 76}], "rank": 1}, {"position": {"node_id": 7967737, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 7967736, "is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 3}]}, "name": "45165544:29134291\t62\t.\t80228819,56;121709999,77;", "quality": "BgYGBgYHBwcHBwcHCAgICAkJCQoLCwsLCwsLCwsLCwwMDAwNDQ4ODg4PEBARERESEhITFBQUFBUVFRUVFhYWFhYWFhYVFRUVFRUVFRUUExQVFRUVFRYWFhYXFxcXFxgYGBgXFxYWFhYWFRUVFRYWFhUVFRUVFRUVFBQUFBQUFBQUFBQTFBQVFBQUFBQUExISEREREBAQDw8PEBEQEBERERAQEBAQEBARERAQEREREREREREQDw8PDw8ODg4ODg8PDw8ODg4ODg4ODxAQEBAQEBAPDw8PEBAQDw8PDg4ODg8PDw8QEBAQEA8QEBAQDw8QEBAQEBAQEREREREQEA8ODw8ODw8PEBEREREREBAQEA8PDw8PDw8PEBARERERERERERAPDw8QEBAQERESEhEQEBERERERERISEhEREREREREREREQEBAQDw8PDw8PDg0NDQ0NDAwMDAwMDAsLCwsLCwsLCwoKCgkJCQgHBwcHBwcHBwcHBwcHBwcHBwYGBQUFBQUFBQUFBQ==", "mapping_quality": 60, "score": 175, "identity": 0.93350383631713552}

but that warning is not what it is failing on. It appears to be failing on the following read

{"sequence": "TTGAATGAAAATAGCTGAGAGAGAGAGAGAGAAAAAGAGAAGTAAAACAAACCTATAAATAGACACACAGTTACACAATAAACAACAATAAACGTTATTGTATGGAACATATGTGTTGTTGTGATTCTTTCTTACGTGTATGTAATTGCCCCGAAATGGTTGCTACTGAGGTGTTGGATAGCGAATCGATTGCCTTGTGATTGAAATAC", "path": {"mapping": [{"position": {"node_id": 1254943, "offset": 7, "is_reverse": true}, "edit": [{"from_length": 18, "to_length": 18}], "rank": 1}, {"position": {"node_id": 1254942, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 1254940, "is_reverse": true}, "edit": [{"from_length": 74, "to_length": 74}], "rank": 3}, {"position": {"node_id": 1254939, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 1254937, "is_reverse": true}, "edit": [{"from_length": 12, "to_length": 12}], "rank": 5}, {"position": {"node_id": 1254935, "is_reverse": true}, "edit": [{"from_length": 19, "to_length": 19}], "rank": 6}, {"position": {"node_id": 1254935, "offset": 19, "is_reverse": true}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 4}, {"position": {"node_id": 1254933, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 1254932, "is_reverse": true}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 6}, {"position": {"node_id": 1254930, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 7}, {"position": {"node_id": 1254929, "is_reverse": true}, "edit": [{"from_length": 44, "to_length": 44}], "rank": 8}, {"position": {"node_id": 1254929, "offset": 44, "is_reverse": true}, "edit": [{"from_length": 28, "to_length": 28}], "rank": 1}, {"position": {"node_id": 1254927, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 1254925, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 3}, {"position": {"node_id": 1254924, "is_reverse": true}, "edit": [{"from_length": 8, "to_length": 8}], "rank": 4}, {"position": {"node_id": 1254923, "is_reverse": true}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 1254921, "is_reverse": true}, "edit": [{"from_length": 86, "to_length": 86}], "rank": 6}]}, "name": "85221274:114987190\t33\t137049119,89;\t40979492,86;126705465,77;", "quality": "AQEBAQEBAgICAwMEBAQEBAQEBQUFBQYHBwcHCAgICAgICAgICQkJCQoKCgoKCgoKCgoKCgoKCgoKCgoKCgoLCwsLCwsLCwsLCwsMDA0ODg4ODg4PDw8PDw8PDw8PDw8PDxAQEA8PDw8QEA8PEA8PDg4ODg4ODg0ODg4NDAwMDAwMDAwMDA0NDQwMDAwLCwsLCwsLCwsLCwwMDQ0ODg4ODg4ODQ0ODw8PDw8QEREREBEQDw8PDw8PDg4ODg4ODg4PDw8PDw8ODg4ODg4ODQ0NDQ0NDQ0NDQ0NDg4ODQ0NDQ0NDQ0MDAwMDAwLDAwMDAwMDAwMDAwMDAwMDAwLCwoKCQkJCQkJCQkJCAcHBwcHBgUFBQUEBAQEBAQEBAQEBAQEBAQEAwMDAwMDAwMDAwMDAwMDAwMCAgICAgICAgEBAQEBAQEBAQEBAQEBAQEBAQ==", "mapping_quality": 60, "score": 178, "identity": 1.0}

And if I remove the assert it gets further but then fails with the following error

terminate called after throwing an instance of 'std::runtime_error' what(): Found pileup for nonexistent edge {"from": 9227995, "to": 9228001} Got signal 6 Manual stack trace: Stack trace from backtrace() for signal 6: vg(vg::stacktrace_with_backtrace_and_exit(int)+0x27) [0x9e1697]

vg(vg::emit_stacktrace(int, siginfo, void)+0x81) [0x9e19e1]

/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x2af634157cb0]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x2af636376035]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x2af63637979b]

/software/gcc-6.2.0/lib64/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x2af635900b5d]

/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8eb16) [0x2af6358feb16]

/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8eb61) [0x2af6358feb61]

/software/gcc-6.2.0/lib64/libstdc++.so.6(+0x8ed78) [0x2af6358fed78]

vg() [0x8f30c9]

vg(vg::Pileups::for_each_edge_pileup(std::function<void (vg::EdgePileup&)> const&)+0x91) [0xb02aa1]

vg(main_augment(int, char**)+0x19f5) [0x8f6915]

vg(vg::subcommand::Subcommand::operator()(int, char**) const+0x28) [0x98b2e8]

vg(main+0xa7) [0x5ae457]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x2af6363617ed]

vg() [0x65a671]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-363748341, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7pcGiIFD9IMuWmvErhnuIvj97LxQks5tSZC6gaJpZM4R3MkZ .

wheaton5 commented 6 years ago

I'm happy to do some work around if there is something simple. Can I remove the ranks from the json and convert back to gam?

glennhickey commented 6 years ago

I just PR'd a patch to ignore the rank check, which will hopefully fix things for you. Will try to look at the underlying issue in map with the data you sent soon.

On Fri, Feb 9, 2018 at 9:05 AM, Haynes Heaton notifications@github.com wrote:

I'm happy to do some work around if there is something simple. Can I remove the ranks from the json and convert back to gam?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-364442407, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7ib40Qq03gITnyTxC6wfd2H1NsiAks5tTFCvgaJpZM4R3MkZ .

wheaton5 commented 6 years ago

Thanks, I tried that and the warnings go away, but still fails on the assert.

vg: src/pileup.cpp:256: void vg::Pileups::compute_from_alignment(vg::Alignment&): Assertion `alignment.sequence().empty() || alignment.path().mapping_size() == 0 || read_offset == alignment.sequence().length()' failed.

glennhickey commented 6 years ago

I see. Thanks for your patience, I've finally got your smaller example open in the debugger and am reproducing that error.

In this case it's on a read with length 297 that's getting aligned as a 475bp insertion. There's something pretty wacky about that gam.

{"sequence": "ACCAACTTTTGTACTGCGACAAATTTTCTGCGAGCTCTTTGTTCTAAAACAACCTCTCAGTTTTAGAACAAAATCAGAGGCAACCGCGAGTGTCGCGGATTTGTGAATTTTTCACAGAAAGGTAATGCTCGCGTTTCGCGACAGTACGATCGAAGTAACCCAAGTGACATCTTCGAGTAGATTTTTTCGTTTTTCGAGCTGCTGGAAGCTTAAAGAGCAGGCTTAAATTTAGCTTAAAGAGCTGGCGATATCATAGACTTTAAGCTTGTTATAACAGTTAAAACGTCAGAGTGAAGC", "path": {"mapping": [{"position": {"node_id": 110931, "offset": 85}, "edit": [{"to_length": 178, "sequence": "TAAAGCGGAAAGGACCCCATTTCTAGCTTTTAAGCGGTTTCAATTTGTTAGCAAACATGCAATTTATACCTAACGTGAAGCAAATTGTACTGATATGACATTTAACCTGCCAAATTTAGAAAACCGCGTATCTCCGAATCCGCGTAAGTCGAGAACCGTGTAACACTGCTTCTAGACG"}, {"to_length": 120, "sequence": "ACCAACTTTTGTACTGCGACAAATTTTCTGCGAGCTCTTTGTTCTAAAACAACCTCTCAGTTTTAGAACAAAATCAGAGGCAACCGCGAGTGTCGCGGATTTGTGAATTTTTCACAGAAA"}, {"to_length": 177, "sequence": "GGTAATGCTCGCGTTTCGCGACAGTACGATCGAAGTAACCCAAGTGACATCTTCGAGTAGATTTTTTCGTTTTTCGAGCTGCTGGAAGCTTAAAGAGCAGGCTTAAATTTAGCTTAAAGAGCTGGCGATATCATAGACTTTAAGCTTGTTATAACAGTTAAAACGTCAGAGTGAAGC"}]}]}, "name": "123753575:102513458\t77\t147686703,57;\t80046144,89;123718764,63;", "quality": "AQICAgICAgICAwMDAwMDAwMDAwMDAwMDAwMDAwMEBAQFBQYHBwcHBwcHCAkJCQkJCgsLDAwMDAwMDA0ODg8PDw8PEBAQEBESEhMUFBUVFhYWFhYWFhYWFxcXGBgYGBgYGBgYGBcWFhcYGBgYGRgYGBkaGhoaGhobGxwcHR0dHR4fHh4eHh4dHBwcHR0dHRwcHBwcHBwbHBscHBwcHBwbGhoZGRkZGhkZGRkYFxcXFhYVFRQUFBQUFRUVFRQUFBMTFBQUFBQUFBQUFBQTExMTFBQUFRUVFBUVFRUWFRUUFBMTExMTEhISEhITExMUFBQUFBQUExMTExQUFRQUExQVFhYWFhYWFhYWFhUVFRUVFRUVFBQUFBQVFRYXFxYWFhYWFhYWFhUVFhYWFhYWFhYWFhYWFhUUFBMTEhIRERESEREREREREREREBAQEBAPDw8PDg4NDQ4ODg4ODg4NDAsLDA0NDAsLCwwMDA0NDQ0NDQ0ODg4ODg4ODw8PDg4NDAwMDAwMDAwMDAwMDAsLCwsLCwsLCwsKCgoKCgoKCgoKCgoKCQkJCQkJCQkJCQkJCQkJCQgICAgICAgHBwcHBwcHBwcHBwYFBAQEBAQDAwMCAgICAgICAQEBAQEBAQ==", "mapping_quality": 60}

I've re-indexed and re-mapped and get the same type of errors. I also tried filtering out score 0 alignments (like above), but the same type of problems come up.

I'm trying mpmap -S (an alternative mapper implementation) and it seems to be taking forever, but will be curious what it returns.

Conclusion: there's something up with your graph/reads that's breaking our mappers. @jeizenga @ekg there's data links upthread to reproduce (it'd help to add the .lcp to smaller_sample/ though).

On Fri, Feb 9, 2018 at 9:35 AM, Haynes Heaton notifications@github.com wrote:

Thanks, I tried that and the warnings go away, but still fails on the assert.

vg: src/pileup.cpp:256: void vg::Pileups::compute_from_alignment(vg::Alignment&): Assertion `alignment.sequence().empty() || alignment.path().mapping_size() == 0 || read_offset == alignment.sequence().length()' failed.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-364450310, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7oNfk2Mv_q_KwdFnJtmZTZB6PnSmks5tTFeagaJpZM4R3MkZ .

glennhickey commented 6 years ago

I've tried aligning the same reads to one of our test graphs and get the same type of error. So it's definitely something to do with the fastq file.

I'm wondering if it's the variable read length that's at issue?

glennhickey commented 6 years ago

It seems to be the read length. Reproduce on any index with something like

vg map -s CCCGGTGCACTCGCTGCCGCCATGAACGAGTTCGCGATCCGGGACATCGATCTCACCCGTATCGAGTCCCGCCCCACCCGGACCGGGCTGGGCACGTATCGCTTTTTCCTGGACTGTGTCGGGCATATCGACGATGTCGCGGTCGGGGAGGCACTCAAGGGACTGCATCGTCGTTGTGAGGATGTGCGGTATTTGGGATCGTGGCCACGAGAGACGACGGCACCCGACGGCGCCAACCCACCGGTGTTGGACGAGGCGTCGGGATGGCTGGCCGAAACGCGGGAAGGGAGGCTGCGATGAGTGGGCGCCTTGTCCTGGTCCGGCATGGACAGTCTTATGGAAACGTCGAGCGCCGACTGGACACCAAACCGCCCGGTGCGGCGCTGACCGAACTCGGATTGCAACAGGCGCGCCTTTTCGCCAAGCAGTATGAAGAGCTGGCGCCAGCGGTCCTGGTGCATTCTGTGGCGCTGCGCGCGGTCCAGACCGCCTCCGAGATCGCGGGACATCTTGGTGTGGAAGCTCTTCAGCTCGACGGTTTGCACGAGGTACAGGCCGGGGAGTTGGAGGACCACACCGGCGTGGAATCGTTTGAGGTTTTCGACCGCACGTATGAGCGTTGGCATTTTGGTGACTTCGGGACCCGTTTGCCTGGTGGCGAATCCGCGCAGGACGTGTTCGACAGGTACCTGCCCGTTGTCGCCGAACTCCGGTTGCGGCACTTGGAAAACGACGCATCAACCGGCGATGTGGTGGTGGTGAGCCACGGCGCGGCGATTCGCCTGGTGGCGGCCACACTCGCCGGAGTGGATCCGGGTTTCGCGGTGAACCATCACCTGCGTAATGCGGAAGCGGTTGTGCTGGCGCCGATTACCGATGGGCGATGGAGCTGCGTGCACTGGGGTGATGCGGTGGCGCCATTCCCGCATCAGGAGGCTGCCAGCGCAGAGGAACTAGCGCAGTCGGCCGACCCGATGGGCTAACCGGCTAGTTGTATCTCGCCTTCGCAATCACATTGTGCGGCAATGCAATCGATAACGAGGCTGTGTCGAAGTAGATAGAGTCCGTCGCAATCGCGATCTGTGCACTCGGCGTGGCGAGTGGAATGCACGATGAGAGTGCCGTGACAGTGATCGATGTCCTGGATGCAGTCACGGCATGCCACACCGCAGCTGATGGCGTGGCGCCACTCCGGTGATTTCTTCATACCGGTGTTGTAGCACCGTATGCCGACAAGTCGTCGTAACCGAGCGGGCGGAAGAGCTCTTTTCTATATTTATATTGCAATGCAATCTGAATATCGATACTATTCCAACTGGTTGGAACCAAGCACGTCACACTGCGATGCAAGGGAGTTGGACTATGGGCAGCCTGCGAGACGACCCGAGGTACGCGGGAACTGTTGAGGAGCGTCGCAGGCGCGACCGGCGGTACGACTTCCTGTTGAAGCTGCACGGGAAGAACGGCAAGAAGGCATTGCAGATCGCCGAGGCGCCGCGCCAGGACGACGGCTACTTCGGGCTCGACTCGCTCAGCTTCAAGGTCTACGAAAACGTCATGATCGCCGGGATGGGTGCACTCTCAGGGTTGTTCATCTCCACCATCGACCCCGATGGCGCTTACGGCGTCGGCCAGCACACGACGTACTACTACGACACGGTGGGCAGGATCCGCCGGAGTCTGTTGTTTTTCGCCGGCGCTGTCGCGGGAGACACCGACGCGGCGGTCAAGGTAGGGCGAGATCTTTTCCGTAAGCACTCGCACGTCAATGGTGATGTGCCATCGACCGGCGAGTCGTATCGTGCGAATCATATTGAGACGCTGAAGTTTACGTATGTGGTCGGGTGGCCGCACCTCTGGCGTGCCTACAAGAGGTTTGGCGACCCGAACGCGACGTATGAGGACGAGTGCGTGTTCTACAAGGAGCAGGTGCGCGTGTGCGAGTTGATGGGCATGCCGGCGGGCGAGCTTCCCGA -x index.xg -g index.gcsa

vg mpmap seems to work though. But it's super slow.

glennhickey commented 6 years ago

Hopefully @ekg will chime in, but one work around would be to filter all reads that are more than, say, 150bp our of your fastq file. You could then align the remaining reads with vg map, and then try putting the long ones in a fasta file to align with vg msga.

wheaton5 commented 6 years ago

erik told me to try vg mod -i instead of vg augment. But that also seems to be very slow. I have not tried vg msga. Probably I dont really care about the unitigs that are 150bp or less. The reason I'm doing this is that there are 2-5kb stretches that are completely diverged in some samples and i would like to include those in the graph if i have some unitigs that span them.

glennhickey commented 6 years ago

You can indeed augment the graph without crashing using vg augment -a direct. This is basically a faster version of vg mod -i. But given the GAM itself is completely corrupt, there's no reason the output of augment won't be nonsense even if it doesn't have an assert to get tripped up on. You can even make a VCF with genotype (which uses this augmenter)

vg genotype reference.vg -G gam.gam -r 3L -v > vcf.vcf

but again, I wouldn't put much stock in the results.

wheaton5 commented 6 years ago

Right. Ok. I guess I'll try msga.

wheaton5 commented 6 years ago

tried making a fasta file with just unitigs > 500bp that map to the linear reference in a 1mb region. Then had a vg graph of just that region with some variants from a vcf added and got the following error

command was vg msga -g reference.vg -f unitigs.fa > msga.vg


[vg msga] failed to include alignment, retrying
expected CAGGCTATCCGGCAGCAGCGGCTCACCCACAAGCAACGCATCAAGGAGTATCTGCGCCGCAGCACGTCCCAGTTTTTCGGCGTCGACCAGCTGCACGATGACTACGAGCAGCAGAAGTGGGAGGACCGGCAGAAGCGGTTCGCCATCCGGCGCTTCGGCT
CGCTCAAGGACGAGCTGCCACCGAACGGGCGCGGACCGTCCGAACCGAGCGCGGACGTCATGCACGACCACCTGACGCACAGCGACCGGCCGGACGTGCTGCCGGCCCAGAGCCAGGACGAGCCGGAAACGGAGCAGAACCGCCGGCAGCGCTTCAACCCGTACTACCG
GCACGAGGGCGGCGAAGAGTTTCTGGTCGAGCGGAAACCGTCCGTGTCGCGCATGATGCTCAGCGGGCTGGTGTTTGTGGTGCAGAGCCTGCGGAACCGGGTGCCGCGCCGGCAGAAGCAGTGGTCCCGCAGCTTCGCCCCGGCGCACGTCGCCCTCAACAATGACGAT
AACGATATCTACGATGGGCTAACGCCGGTGCAGGAGGATGAGATGTTTTTCGATGTGCCGGCGGGTGGTGGTCAGAATGGTGCGGGCGGTAGTGTCCCGCCGACCCAGCAGGAGCAAGATCTGGCCGGTGCGCCAGTGCCGGACAATTCGCACCAGATCTACATGCTCG
AGACGGACCGGGCGCTCGTGAGTAGCGGGTGGCGGACGCGCACGGCCGAGGAGCAGCTGCATCTGCAGGACGTTGCGGCCGGGCGCAACAACGCCATGTTTCAGTTCTTTTACGGCCAGCGCATACCGGGCACGATACTGGAGAGCGTGCTGGACAACTCGAGACGTCC
GCCGCGCCACTGCATCAAGCTGCTGCGGCCGAACGCGCTGGACGACCGGTACGACTACCGGCCGTACTTTACATACTGGAT
got      CAGGCTATCCGGCAGCAGCGGCTCACCCACAAGCAACGCATCAAGGAGTATCTGCGCCGCAGCACGTCCCAGTTTTTCGGCGTCGACCAGCTGCACGATGACTACGAGCAGCAGAAGTGGGAGGACCGGCAGAAGCGGTTCGCCATCCGGCGCTTCGGCT
CGCTCAAGGACGAGCTGCCACCGAACGGGCGCGGACCGTCCGAACCGAGCGCGGACGTCATGCACGACCACCTGACGCACAGCGACCGGCCGGACGTGCTGCCGGCCCAGAGCCAGGACGAGCCGGAAACGGAGCAGAACCGCCGGCAGCGCTTCAACCCGTACTACCG
GCACGAGGGCGGCGAAGAGTTTCTGGTCGAGCGGAAACCGTCCGTGTCGCGCATGATGCTCAGCGGGCTGGTGTTTGTGGTGCAGAGCCTGCGGAACCGGGTGCCGCGCCGGCAGAAGCAGTGGTCCCGCAGCTTCGCCCCGGCGCACGTCGCCCTCAACAATGACGAT
AACGATATCTACGATGGGCTAACGCCGGTGCAGGAGGATGAGATGTTTTTCGATGTGCCGGCGGGTGGTGGTCAGAATGGTGCGGGCGGTAGTGTCCCGCCGACCCAGCAGGAGCAAGATCTGGCCGGTGCGCCAGTGCCGGACAATTCGCACCAGATCTACATGCTCG
AGACGGACCGGGCGCTCGTGAGTAGCGGGTGGCGGACGCGCACGGCCGAGGAGCAGCTGCATCTGCAGGACGTTGCGGCCGGGCGCAACAACGCCATGTTTCAGTTCTTTTACGGCCAGCGCATACCGGGCACGATACTGGAGAGCGTGCTGGACAACTCGAGACGTCC
GCCGCGCCACTGCATCAAGCTGCTGCGGCCGAACGCGCTGGACGACCGGTACGACTACCGGCCGTACTTTACATACTGGATCAGGCTATCCGGCAGCAGCGGCTCACCCACAAGCAACGCATCAAGGAGTATCTGCGCCGCAGCACGTCCCAGTTTTTCGGCGTCGACC
AGCTGCACGATGACTACGAGCAGCAGAAGTGGGAGGACCGGCAGAAGCGGTTCGCCATCCGGCGCTTCGGCTCGCTCAAGGACGAGCTGCCACCGAACGGGCGCGGACCGTCCGAACCGAGCGCGGACGTCATGCACGACCACCTGACGCACAGCGACCGGCCGGACGT
GCTGCCGGCCCAGAGCCAGGACGAGCCGGAAACGGAGCAGAACCGCCGGCAGCGCTTCAACCCGTACTACCGGCACGAGGGCGGCGAAGAGTTTCTGGTCGAGCGGAAACCGTCCGTGTCGCGCATGATGCTCAGCGGGCTGGTGTTTGTGGTGCAGAGCCTGCGGAAC
CGGGTGCCGCGCCGGCAGAAGCAGTGGTCCCGCAGCTTCGCCCCGGCGCACGTCGCCCTCAACAATGACGATAACGATATCTACGATGGGCTAACGCCGGTGCAGGAGGATGAGATGTTTTTCGATGTGCCGGCGGGTGGTGGTCAGAATGGTGCGGGCGGTAGTGTCC
CGCCGACCCAGCAGGAGCAAGATCTGGCCGGTGCGCCAGTGCCGGACAATTCGCACCAGATCTACATGCTCGAGACGGACCGGGCGCTCGTGAGTAGCGGGTGGCGGACGCGCACGGCCGAGGAGCAGCTGCATCTGCAGGACGTTGCGGCCGGGCGCAACAACGCCAT
GTTTCAGTTCTTTTACGGCCAGCGCATACCGGGCACGATACTGGAGAGCGTGCTGGACAACTCGAGACGTCCGCCGCGCCACTGCATCAAGCTGCTGCGGCCGAACGCGCTGGACGACCGGTACGACTACCGGCCGTACTTTACATACTGGAT
{"mapping": [{"position": {"node_id": 43183}, "edit": [{"from_length": 5, "to_length": 5}], "rank": 1}, {"position": {"node_id": 43184}, "edit": [{"from_length": 1, "to_
length": 1}], "rank": 2}, {"position": {"node_id": 43186}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 3}, {"position": {"node_id": 43187}, "edit": [{"from_l
ength": 29, "to_length": 29}], "rank": 4}, {"position": {"node_id": 43188}, "edit": [{"from_length": 18, "to_length": 18}], "rank": 5}, {"position": {"node_id": 43188, "
offset": 18}, "edit": [{"from_length": 11, "to_length": 11}], "rank": 2}, {"position": {"node_id": 43189}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 3}, {"
position": {"node_id": 43190}, "edit": [{"from_length": 18, "to_length": 18}], "rank": 4}, {"position": {"node_id": 43190, "offset": 18}, "edit": [{"from_length": 11, "t
o_length": 11}], "rank": 2}, {"position": {"node_id": 43191}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 3}, {"position": {"node_id": 43193}, "edit": [{"fro
m_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43194}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 5}, {"position": {"node_id": 43194,"offset": 17}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 3}, {"position": {"node_id": 43195}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 4}, {"position": {"node_id": 43196}, "edit": [{"from_length": 13, "to_length": 13}], "rank": 5}, {"position": {"node_id": 43196, "offset": 13}, "edit": [{"from_length": 18, "to_length": 18}], "rank": 2}, {"position": {"node_id": 43197}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 3}, {"position": {"node_id": 43198}, "edit": [{"from_length": 9, "to_length": 9}], "rank": 4}, {"position": {"node_id": 43198, "offset": 9}, "edit": [{"from_length": 22, "to_length": 22}], "rank": 2}, {"position": {"node_id": 43199}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 3}, {"position": {"node_id": 43200}, "edit": [{"from_length": 5, "to_length": 5}], "rank": 4}, {"position": {"node_id": 43200, "offset": 5}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 2}, {"position": {"node_id": 43201}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 3}, {"position": {"node_id": 43202}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43202, "offset": 1}, "edit": [{"from_length": 30, "to_length": 30}], "rank": 2}, {"position": {"node_id": 43203}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 3}, {"position": {"node_id": 43205}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43206}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 5}, {"position": {"node_id": 43206, "offset": 26}, "edit": [{"from_length": 6, "to_length": 6}], "rank": 4}, {"position": {"node_id": 43208}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 43209}, "edit": [{"from_length": 11, "to_length": 11}], "rank": 6}, {"position": {"node_id": 43210}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 7}, {"position": {"node_id": 43212}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 8}, {"position": {"node_id": 43213}, "edit": [{"from_length": 12, "to_length": 12}], "rank": 9}, {"position": {"node_id": 43213, "offset": 12}, "edit": [{"from_length": 15, "to_length": 15}], "rank": 2}, {"position": {"node_id": 43214}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 3}, {"position": {"node_id": 43216}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43217}, "edit": [{"from_length": 15, "to_length": 15}], "rank": 5}, {"position": {"node_id": 43217, "offset": 15}, "edit": [{"from_length": 10, "to_length": 10}], "rank": 3}, {"position": {"node_id": 43218}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 4}, {"position": {"node_id": 43219}, "edit": [{"from_length": 23, "to_length": 23}], "rank": 5}, {"position": {"node_id": 43219, "offset": 23}, "edit": [{"from_length": 2, "to_length": 2}], "rank": 2}, {"position": {"node_id": 43220}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 3}, {"position": {"node_id": 43222}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43223}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 5}, {"position": {"node_id": 43225}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 6}, {"position": {"node_id": 43226}, "edit": [{"from_length": 3, "to_length": 3}], "rank": 7}, {"position": {"node_id": 43226, "offset": 3}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 3}, {"position": {"node_id": 43227}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 4}, {"position": {"node_id": 43228}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 5}, {"position": {"node_id": 43230}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 6}, {"position": {"node_id": 43231}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 7}, {"position": {"node_id": 43231, "offset": 1}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 4}, {"position": {"node_id": 43232}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 5}, {"position": {"node_id": 43233}, "edit": [{"from_length": 5, "to_length": 5}], "rank": 6}, {"position": {"node_id": 43233, "offset": 5}, "edit": [{"from_length": 22, "to_length": 22}], "rank": 3}, {"position": {"node_id": 43235}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 4}, {"position": {"node_id": 43236}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 5}, {"position": {"node_id": 43237}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 6}, {"position": {"node_id": 43239}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 7}, {"position": {"node_id": 43240}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 8}, {"position": {"node_id": 43242}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 9}, {"position": {"node_id": 43243}, "edit": [{"from_length": 8, "to_length": 8}], "rank": 10}]}
{"mapping": [{"position": {"node_id": 43183}, "edit": [{"from_length": 5, "to_length": 5}], "rank": 1}, {"position": {"node_id": 43184}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 2}, {"position": {"node_id": 43186}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 3}, {"position": {"node_id": 43187}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 4}, {"position": {"node_id": 43188}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 5}, {"position": {"node_id": 43189}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 6}, {"position": {"node_id": 43190}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 7}, {"position": {"node_id": 43191}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 8}, {"position": {"node_id": 43193}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 9}, {"position": {"node_id": 43194}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 10}, {"position": {"node_id": 43195}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 11}, {"position": {"node_id": 43196}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 12}, {"position": {"node_id": 43197}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 13}, {"position": {"node_id": 43198}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 14}, {"position": {"node_id": 43199}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 15}, {"position": {"node_id": 43200}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 16}, {"position": {"node_id": 43201}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 17}, {"position": {"node_id": 43202}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 18}, {"position": {"node_id": 43203}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 19}, {"position": {"node_id": 43205}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 20}, {"position": {"node_id": 43206}, "edit": [{"from_length": 32, "to_length": 32}], "rank": 21}, {"position": {"node_id": 43208}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 22}, {"position": {"node_id": 43209}, "edit": [{"from_length": 11, "to_length": 11}], "rank": 23}, {"position": {"node_id": 43210}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 24}, {"position": {"node_id": 43212}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 25}, {"position": {"node_id": 43213}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 26}, {"position": {"node_id": 43214}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 27}, {"position": {"node_id": 43216}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 28}, {"position": {"node_id": 43217}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 29}, {"position": {"node_id": 43218}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 30}, {"position": {"node_id": 43219}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 31}, {"position": {"node_id": 43220}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 32}, {"position": {"node_id": 43222}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 33}, {"position": {"node_id": 43223}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 34}, {"position": {"node_id": 43225}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 35}, {"position": {"node_id": 43226}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 36}, {"position": {"node_id": 43227}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 37}, {"position": {"node_id": 43228}, "edit": [{"from_length": 1, "to_length": 1}], "rank":38}, {"position": {"node_id": 43230}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 39}, {"position": {"node_id": 43231}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 40}, {"position": {"node_id": 43232}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 41}, {"position": {"node_id": 43233}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 42}, {"position": {"node_id": 43235}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 43}, {"position": {"node_id": 43236}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 44}, {"position": {"node_id": 43237}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 45}, {"position": {"node_id": 43239}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 46}, {"position": {"node_id": 43240}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 47}, {"position": {"node_id": 43242}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 48}, {"position": {"node_id": 43243}, "edit": [{"from_length": 8, "to_length": 8}], "rank": 49}, {"position": {"node_id": 43183}, "edit": [{"from_length": 5, "to_length": 5}], "rank": 50}, {"position": {"node_id": 43184}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 51}, {"position": {"node_id": 43186}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 52}, {"position": {"node_id": 43187}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 53}, {"position": {"node_id": 43188}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 54}, {"position": {"node_id": 43189}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 55}, {"position": {"node_id": 43190}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 56}, {"position": {"node_id": 43191}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 57}, {"position": {"node_id": 43193}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 58}, {"position": {"node_id": 43194}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 59}, {"position": {"node_id": 43195}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 60}, {"position": {"node_id": 43196}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 61}, {"position": {"node_id": 43197}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 62}, {"position": {"node_id": 43198}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 63}, {"position": {"node_id": 43199}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 64}, {"position": {"node_id": 43200}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 65}, {"position": {"node_id": 43201}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 66}, {"position": {"node_id": 43202}, "edit": [{"from_length": 31, "to_length": 31}], "rank": 67}, {"position": {"node_id": 43203}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 68}, {"position": {"node_id": 43205}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 69}, {"position": {"node_id": 43206}, "edit": [{"from_length": 32, "to_length": 32}], "rank": 70}, {"position": {"node_id": 43208}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 71}, {"position": {"node_id": 43209}, "edit": [{"from_length": 11, "to_length": 11}], "rank": 72}, {"position": {"node_id": 43210}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 73}, {"position": {"node_id": 43212}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 74}, {"position": {"node_id": 43213}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 75}, {"position": {"node_id": 43214}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 76}, {"position": {"node_id": 43216}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 77}, {"position": {"node_id": 43217}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 78}, {"position": {"node_id": 43218}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 79}, {"position": {"node_id": 43219}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 80}, {"position": {"node_id": 43220}, "edit": [{"from_length": 25, "to_length": 25}], "rank": 81}, {"position": {"node_id": 43222}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 82}, {"position": {"node_id": 43223}, "edit": [{"from_length": 26, "to_length": 26}], "rank": 83}, {"position": {"node_id": 43225}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 84}, {"position": {"node_id": 43226}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 85}, {"position": {"node_id": 43227}, "edit": [{"from_length": 29, "to_length": 29}], "rank": 86}, {"position": {"node_id": 43228}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 87}, {"position": {"node_id": 43230}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 88}, {"position": {"node_id": 43231}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 89}, {"position": {"node_id": 43232}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 90}, {"position": {"node_id": 43233}, "edit": [{"from_length": 27, "to_length": 27}], "rank": 91}, {"position": {"node_id": 43235}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 92}, {"position": {"node_id": 43236}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 93}, {"position": {"node_id": 43237}, "edit": [{"from_length": 17, "to_length": 17}], "rank": 94}, {"position": {"node_id": 43239}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 95}, {"position": {"node_id": 43240}, "edit": [{"from_length": 14, "to_length": 14}], "rank": 96}, {"position": {"node_id": 43242}, "edit": [{"from_length": 1, "to_length": 1}], "rank": 97}, {"position": {"node_id": 43243}, "edit": [{"from_length": 8, "to_length": 8}], "rank": 98}]}
[vg msga] Error: failed to include path
wheaton5 commented 6 years ago

sorry maybe i should put that in a new issue

wheaton5 commented 6 years ago

when you say vg mpmap is slow, exactly how slow? If I just wanted to incorporate a few tens of thousands of long unitigs that map to the linear reference in areas of interest?

glennhickey commented 6 years ago

It's probably worth trying. Will depend on the lengths of your longest unitigs more than anything else. I extrapolated about 100 CPU hours judging from the first few thousand entries from your first_1M_unitigs.mag.gz file. This isn't so bad if you have a lot of cores. But there could be some longer sequences I didn't get to that make it much worse.

On Fri, Feb 9, 2018 at 12:04 PM, Haynes Heaton notifications@github.com wrote:

when you say vg mpmap is slow, exactly how slow? If I just wanted to incorporate a few tens of thousands of long unitigs that map to the linear reference in areas of interest?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-364494514, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7gReMR7x7gb06-XHcJIpqbS041yaks5tTHqmgaJpZM4R3MkZ .

glennhickey commented 6 years ago

Note: the interface is the same for mpmap as map but you need to specify -S to get GAM output.

On Fri, Feb 9, 2018 at 12:11 PM, Glenn Hickey glenn.hickey@gmail.com wrote:

It's probably worth trying. Will depend on the lengths of your longest unitigs more than anything else. I extrapolated about 100 CPU hours judging from the first few thousand entries from your first_1M_unitigs.mag.gz file. This isn't so bad if you have a lot of cores. But there could be some longer sequences I didn't get to that make it much worse.

On Fri, Feb 9, 2018 at 12:04 PM, Haynes Heaton notifications@github.com wrote:

when you say vg mpmap is slow, exactly how slow? If I just wanted to incorporate a few tens of thousands of long unitigs that map to the linear reference in areas of interest?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-364494514, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7gReMR7x7gb06-XHcJIpqbS041yaks5tTHqmgaJpZM4R3MkZ .

jeizenga commented 6 years ago

I did some investigation on why the reads were so slow using mpmap. It looks to me that the long unitigs in that file are completely unrelated to the sequence in the graph. As in, there is no subsequence in them that can be matched to the graph above the length I would expect by chance. That tends to make the optimizations work pretty poorly, especially with very long sequences.

Admittedly, I haven't spent much time making mpmap work well on long sequences. However, I wonder if this might be an artifact of the subsetting you did to make the files smaller? If this is only part of the graph, that would explain a lot.

ekg commented 6 years ago

Thanks to everyone for digging into this, and apologies for missing the thread until now. I need to sort out some kind of notification filtering that my email provider has implemented. That said I've only been able to put time into resolving the problems here in the last few days.

The long read alignment code had gone a bit stale, and needed to be re-evaluated and updated. Unfortunately none of the problems found here were caught otherwise by the tests. The general issue in this thread is that banding (or more correctly, chunked alignment as @jeizenga suggests calling it) was introducing various artifacts into the reads. I've resolved most of these on the "mappering" branch, in this PR #1456, but I have mixed together some testing related to Jordan's MEM finding code from mpmap so it's not ready to merge as I still need to debug that.

The simple solution to the problems @wheaton5 ran into wasn't immediately obvious. It is enough to increase the "band" width in vg map above the length of the unitigs and we won't see these problems anymore, so vg map -w 10000 would probably have fixed most of them.

There were also problems with read naming causing failures with vg msga. This would have been an issue with vg mod -i augmentation, but I can't really see how that would be the case here because the naming problem only arose in #1448.

The main issue was a bug in the merge code that was run after running alignment bands. This would result in the sequence reported in the GAM being too short (the first band's sequence was dropped).

Having resolved this I found that I could re-enable the alignment patching process, in which the unaligned portions of the read were force-aligned into the graph between aligned segments. This improved the overall alignment quality substantially.

Finally, I explored increasing the bandwidth, to something like 2kb. I find this works well on the set that @wheaton5 provided.

I can confirm what @jeizenga finds, that the 1M set mostly contains unitigs that don't match the reference.vg graph. I only see 4% aligning with >70% identity, and most look to be spurious matches. However, in the reduced set the mean read identity with the graph is >90%.

Here is a process that can be used to take the restricted unutig set and the graph they map to to build a combined graph. (I'm basing this off of https://drive.google.com/open?id=1jtMG9kWA9FYKxGJU_GhiKdMaAP9whkV1, which was generated for #1448. If you want to follow along, at present you'll need to pull from the "mappering" branch.)

vg index -x reference.xg reference.vg
vg mod -pl 16 -e 3 reference.vg | vg kmers -k 16 -gB - >reference.kmers
vg index -i reference.kmers -g reference.gcsa -p
cat unitigs.fa | sed 's/> />/' >unitigs.fixed.fa # fix the FASTA format naming error
vg map -d reference -f unitigs.fixed.fa -w 2048 >unitigs.gam
vg mod -i unitigs.gam reference.vg | vg mod -X 32 - | vg mod -cC - >reference+unitigs.vg

I then checked to see if I could remap the unitigs into the graph and find them:

vg index -x reference+unitigs.xg reference+unitigs.vg
vg mod -pl 16 -e 3 reference+unitigs.vg | vg kmers -k 16 -gB - >reference+unitigs.kmers
vg index -i reference+unitigs.kmers -g reference+unitigs.gcsa
vg map -g reference+unitigs.gcsa -x reference+unitigs.xg -f unitigs.fixed.fa -j -t 1 -w 2048 | jq .identity | histogram.py

When using the same bandwidth, I get perfect mapping of every read, as expected. If I have a different bandwidth (say 256, which is default), I see that a few out of every ~10k bases is not mapping exactly to the graph (mean identity is 0.9998). To me this suggests that the process has worked correctly, but maybe there is another metric we'd like to consider.

It's also possible to look at the unitig threadings through the graph to try to debug what's been generated, but it can't easily be done for the whole graph at once. Instead we can examine the graph with respect to each unitig:

vg paths -X reference+unitigs.xg | parallel 'vg find -p {} -x reference+unitigs.xg -c 5 | vg view -dp - | dot -Tpdf -o plots/{}.pdf; echo {}'

Here is a sampling of plots to give a sense of what we can see: w2048.plots.pdf

We might want to remove spurious mappings, although there are only a handful of them. We could filter by identity using something like jq -c 'select(.identity > 0.5)' on the JSON-GAM of the unitig alignments.

In general the graph doesn't look too crazy to me. But any collection of true SVs is going to add strange complexity into it. We are basically executing a kind of assembly step. This whole process has made me wonder if we might do better aligning the linear reference into an assembly graph. This could be considerably more efficient to generate.

CC @richarddurbin, who may be interested in this result as it came up in discussion at @wheaton5 's group meeting on Friday.

richarddurbin commented 6 years ago

Thanks Erik for the update.

It is very good that you have fixed the problems, but a concern that they had accumulated.

There isn’t just one assembly graph here. These unitigs come from fermikit assemblies of multiple samples.

Richard

On 20 Feb 2018, at 18:07, Erik Garrison notifications@github.com wrote:

Thanks to everyone for digging into this, and apologies for missing the thread until now. I need to sort out some kind of notification filtering that my email provider has implemented. That said I've only been able to put time into resolving the problems here in the last few days.

The long read alignment code had gone a bit stale, and needed to be re-evaluated and updated. Unfortunately none of the problems found here were caught otherwise by the tests. The general issue in this thread is that banding (or more correctly, chunked alignment as @jeizenga https://github.com/jeizenga suggests calling it) was introducing various artifacts into the reads. I've resolved most of these on the "mappering" branch, in this PR #1456 https://github.com/vgteam/vg/pull/1456, but I have mixed together some testing related to Jordan's MEM finding code from mpmap so it's not ready to merge as I still need to debug that.

The simple solution to the problems @wheaton5 https://github.com/wheaton5 ran into wasn't immediately obvious. It is enough to increase the "band" width in vg map above the length of the unitigs and we won't see these problems anymore, so vg map -w 10000 would probably have fixed most of them.

There were also problems with read naming causing failures with vg msga. This would have been an issue with vg mod -i augmentation, but I can't really see how that would be the case here because the naming problem only arose in #1448 https://github.com/vgteam/vg/issues/1448.

The main issue was a bug in the merge code that was run after running alignment bands. This would result in the sequence reported in the GAM being too short (the first band's sequence was dropped).

Having resolved this I found that I could re-enable the alignment patching process, in which the unaligned portions of the read were force-aligned into the graph between aligned segments. This improved the overall alignment quality substantially.

Finally, I explored increasing the bandwidth, to something like 2kb. I find this works well on the set that @wheaton5 https://github.com/wheaton5 provided https://drive.google.com/open?id=1DoFu3yRs9Zeuyh1pw3Z1dfLoGPkuW0eV.

I can confirm what @jeizenga https://github.com/jeizenga finds, that the 1M set mostly contains unitigs that don't match the reference.vg graph. I only see 4% aligning with >70% identity, and most look to be spurious matches. However, in the reduced set the mean read identity with the graph is >90%.

Here is a process that can be used to take the restricted unutig set and the graph they map to to build a combined graph. (I'm basing this off of https://drive.google.com/open?id=1DoFu3yRs9Zeuyh1pw3Z1dfLoGPkuW0eV https://drive.google.com/open?id=1DoFu3yRs9Zeuyh1pw3Z1dfLoGPkuW0eV. If you want to follow along, at present you'll need to pull from the "mappering" branch.)

vg index -x reference.xg reference.vg vg mod -pl 16 -e 3 reference.vg | vg kmers -k 16 -gB - >reference.kmers vg index -i reference.kmers -g reference.gcsa -p cat unitigs.fa | sed 's/> />/' >unitigs.fixed.fa # fix the FASTA format naming error vg map -d reference -f unitigs.fixed.fa -w 2048 >unitigs.gam vg mod -i unitigs.gam reference.vg | vg mod -X 32 - | vg mod -cC - >reference+unitigs.vg I then checked to see if I could remap the unitigs into the graph and find them:

vg index -x reference+unitigs.xg reference+unitigs.vg vg mod -pl 16 -e 3 reference+unitigs.vg | vg kmers -k 16 -gB - >reference+unitigs.kmers vg index -i reference+unitigs.kmers -g reference+unitigs.gcsa vg map -g reference+unitigs.gcsa -x reference+unitigs.xg -f unitigs.fixed.fa -j -t 1 -w 2048 | jq .identity | histogram.py When using the same bandwidth, I get perfect mapping of every read, as expected. If I have a different bandwidth (say 256, which is default), I see that a few out of every ~10k bases is not mapping exactly to the graph (mean identity is 0.9998). To me this suggests that the process has worked correctly, but maybe there is another metric we'd like to consider.

It's also possible to look at the unitig threadings through the graph to try to debug what's been generated, but it can't easily be done for the whole graph at once. Instead we can examine the graph with respect to each unitig:

vg paths -X reference+unitigs.xg | parallel 'vg find -p {} -x reference+unitigs.xg -c 5 | vg view -dp - | dot -Tpdf -o plots/{}.pdf; echo {}' Here is a sampling of plots to give a sense of what we can see: w2048.plots.pdf https://github.com/vgteam/vg/files/1741218/w2048.plots.pdf We might want to remove spurious mappings, although there are only a handful of them. We could filter by identity using something like jq -c 'select(.identity > 0.5)' on the JSON-GAM of the unitig alignments.

In general the graph doesn't look too crazy to me. But any collection of true SVs is going to add strange complexity into it. We are basically executing a kind of assembly step. This whole process has made me wonder if we might do better aligning the linear reference into an assembly graph. This could be considerably more efficient to generate.

CC @richarddurbin https://github.com/richarddurbin, who may be interested in this result as it came up in discussion at @wheaton5 https://github.com/wheaton5 's group meeting on Friday.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1417#issuecomment-367066089, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRb5qKU2-mfoeRzMIbN6YmIcK5-Zjtdks5tWwnpgaJpZM4R3MkZ.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.