vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

MSGA produces worse results on repeat-masked data than non-repeat-masked data #632

Open adamnovak opened 7 years ago

adamnovak commented 7 years ago

Ian has some sequences that he's aligning with MSGA. MSGA crashes on them when the MEM threader is on, but it runs through when the MEM threader is off. However, when the sequences have repetitive elements replaced by Ns, the alignment becomes terrible, with many nonsense structural variants.

I think what's happening is that the index doesn't know about N, and so it's returning hits containing Ns as evidence that part of a sequence ought to align to part of another sequence. We should throw out hits involving (too many?) Ns so we don't align runs of Ns in the input to runs of Ns in the genome.

ekg commented 7 years ago

Could you share these sequences and a minimal example producing the crash?

On Thu, Jan 19, 2017, 8:58 PM Adam Novak notifications@github.com wrote:

Ian has some sequences that he's aligning with MSGA. MSGA crashes on them when the MEM threader is on, but it runs through when the MEM threader is off. However, when the sequences have repetitive elements replaced by Ns, the alignment becomes terrible, with many nonsense structural variants.

I think what's happening is that the index doesn't know about N, and so it's returning hits containing Ns as evidence that part of a sequence ought to align to part of another sequence. We should throw out hits involving (too many?) Ns so we don't align runs of Ns in the input to runs of Ns in the genome.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/632, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EUYLELcWvvotc--GvaUq8yPuy8Q8ks5rT85sgaJpZM4LomGj .

adamnovak commented 7 years ago

Here's an example of sequences that produce a bad graph with lots of weird structure between N nodes. It looks like MSGA decides to align some Ns but not others, and then normalization can't make much headway.

vg msga -f combined_v2.fa.masked.txt > combined.vg

combined_v2.fa.masked.txt

Look around nodes 332 and 338.

It looks like the MEM threader isn't crashing as much; I'll see if I can dig up something and make another issue.

adamnovak commented 7 years ago

It's not MEM-threader-specific, but I can induce a crash (via an exception) in #637.