vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Polaris SVs as a graph #3704

Closed goranrakocevic closed 10 months ago

goranrakocevic commented 2 years ago

Hi All,

First off apologies for posting this here, but the Biostars link appears dead.

Has anyone tried using Ilummina’s Polaris SV set as a graph? We are able to build and index the graph, but the mapping is very slow; for a smaller subset of the FASTQ (~first 5M reads) it seems to go fairly fast (~3K reads/s), but at some point it just bogs downs, and there is very little progress for hours looking at the size of the output.

Looking at the VCF it doesn’t seem like there are any very dense regions, and the total number of variants is ~40K (after taking out anything that’s not PASS in the Filter column), but I’m unsure of what to look for as difficult for VG to tackle, so any pointers would be helpful.

The VCF is not phased, but the variants should mostly be far enough apart for that to matter too much.

jeizenga commented 2 years ago

For starters, which of the mapping subcommands are you using? Also, are you mapping with a single thread or with multiple threads?

goranrakocevic commented 2 years ago

Forgot to say:

goranrakocevic commented 2 years ago

A small update, if it helps: if I reduce --hard-hit-cap to 100 the problem seems to go away (i.e. mapping is very fast). I tried a few other parameters, this is the only one it seems sensitive to.

jeizenga commented 2 years ago

I suspect that there's probably a problem on a single read causing the slowdown. It's very possible that changing the computational parameters might avoid hitting whatever extremely slow/infinite loop is responsible. If I'm right, we'd expect to see that the thread utilization drops down to just a few threads toward the end.

If it is a single read, it would definitely help us diagnose the problem if we had access to the input data. Are you able to share it?

goranrakocevic commented 2 years ago

Yes, the fastqs are public (1KG data), this sample: https://www.ebi.ac.uk/ena/browser/view/ERR3239276?show=reads

It doesn't seem to be a single read, because writes to the output gam slow down (didn't check the code, but I assume the output thread would still be writing alignments from the other threads).

Another thought: should I be masking out any sections of the reference? From the description of hard-hit-cap sounds like it's something related to repeats.

jeizenga commented 2 years ago

Okay, that's interesting. I'll shop it around the VG team.

And, no, masking shouldn't be necessary. You are correct that the hard hit cap is intended to handle repeats. Essentially, it functions to avoid expending effort aligning to sequences that are so repetitive that we don't really have any chance of mapping correctly anyway with a short read.

adamnovak commented 2 years ago

It sounds like there's a few reads that are slowing it down. When the file is big enough, each thread eventually encounters one of these nearly-impossible reads and gets stuck.

We could hook up the watchdog system to Giraffe to detect and report when threads seem to be taking an inordinately long time to map particular reads, and let us identify the reads.

Another approach is, if any of these reads do eventually get mapped, to look up the reads that are taking the longest to map in the partial output file. To do this, we would map to GAM, and then run it through something like:

vg view -aj output.gam | jq -r '[.time_used, .name] | @tsv' | sort -n | tail -n50

That would give us the mapping times in seconds and the names of the reads that are taking the longest to map, but do eventually finish. Then we could investigate if anything is unusual about the sequences or eventual mapping locations of those reads.

goranrakocevic commented 2 years ago

They do map, just take a long time. Is the time saved by default? (Ie we can do it on any gam)

adamnovak commented 2 years ago

I looked at the code, and the time_used field is set by Giraffe whether or not you --track-provenance or --track-correctness. So it should be in any Giraffe GAM file.

goranrakocevic commented 2 years ago

Hi All,

Sorry for the long silence.

So far it seems we don't see very few reads that are very slow to map, but a larger number of reads somewhat slow to map (that said this is from using a subset from the begging of the fastq file, but far enough into the file that the slowdown occurs by looking at the writes to the output).

While most reads map in 1e-4 seconds, there is a bunch of these that take .3s (example below).

{"annotation": {"fragment_length": 406, "fragment_length_distribution": "-I 422.645263 -D 97.748476", "mapq_applied_cap": 303.67512390409416, "mapq_explored_cap": 69.147046527023292, "mapq_score_group": 2147483647, "mapq_uncapped": 2147483647, "proper_pair": true, "secondary_scores": [319.98951889597208, 294.98951889597208, 294.98951889597208, 284.98951889597208]}, "fragment_next": {"name": "ERR3239306.2213273"}, "identity": 1.0, "mapping_quality": 60, "name": "ERR3239306.2213273", "path": {"mapping": [{"edit": [{"from_length": 13, "to_length": 13}], "position": {"node_id": "525012", "offset": "19"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525013"}, "rank": "2"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525014"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525015"}, "rank": "4"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"node_id": "525016"}, "rank": "5"}, {"edit": [{"from_length": 9, "to_length": 9}], "position": {"node_id": "525017"}, "rank": "6"}]}, "quality": "Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eCh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4e", "score": 160, "sequence": "TGTGCCTATAGGTCCTCCCTGTGGCAATGACATCTCTCAGCTCAGTAAGGGCCATTTGCAGTAGGAATATGACCCTAACCAGAAGACTCAGTGGATCCTTATCACCTTCATAGAAAGGTACTCACCATCCATGTCAAGAGCCCAGCCAAC", "time_used": 0.28015876499999998} {"annotation": {"fragment_length": 406, "fragment_length_distribution": "-I 422.645263 -D 97.748476", "mapq_applied_cap": 303.67512390409416, "mapq_explored_cap": 82.69051542502379, "mapq_score_group": 2147483647, "mapq_uncapped": 2147483647, "proper_pair": true, "secondary_scores": [319.98951889597208, 294.98951889597208, 294.98951889597208, 284.98951889597208]}, "fragment_prev": {"name": "ERR3239306.2213273"}, "identity": 1.0, "mapping_quality": 60, "name": "ERR3239306.2213273", "path": {"mapping": [{"edit": [{"from_length": 9, "to_length": 9}], "position": {"is_reverse": true, "node_id": "525025", "offset": "23"}, "rank": "1"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525024"}, "rank": "2"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525023"}, "rank": "3"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525022"}, "rank": "4"}, {"edit": [{"from_length": 32, "to_length": 32}], "position": {"is_reverse": true, "node_id": "525021"}, "rank": "5"}, {"edit": [{"from_length": 13, "to_length": 13}], "position": {"is_reverse": true, "node_id": "525020"}, "rank": "6"}]}, "quality": "Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHhQeHh4eHh4eHh4eHh4eHh4eHh4eHh4e", "score": 160, "sequence": "GTCCTTTTTCTTTTCAAACTCTTCCTTATGTTAGCCATGAAATCTAGCTGGGGCTGTGTGGTTTCTGATTCCCCCTGGCTTATTCTTTACTTTTTCCCACTGTTCCAGGCTCAGCAGGGAGCTGCTGGATGAGAAAGGGCCTGAAGTCTT", "time_used": 0.28015862000000002}

akorhone commented 1 year ago

There is a new feature addition (https://github.com/vgteam/vg/pull/3736) that is pending integration into the main codebase - this will speed up the slow alignments.

I'll update this thread when it has been added to a release.