vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

TargetValueSearch returns infinite distance when there is a path #1953

Open jeizenga opened 6 years ago

jeizenga commented 6 years ago

@xchang1

I'm trying to integrate TVS into my clusterer, but it's not working for me. I have a small-ish reproducing example, which you'll need to run from my optimize-tvs-clustering branch. The command is this (files attached):

vg mpmap -x snp1kg-BRCA1.xg -g snp1kg-BRCA1.gcsa -d snp1kg-BRCA1.dist -s snp1kg-BRCA1.snarls -G read.gam -SBv | vg view -aj -

It will eventually try to measure the distance between 2256+13 and 2261+17 with target value 64. The path-based distance function estimates the distance at 63, but the TVS says it can't find a path.

screen shot 2018-10-23 at 1 56 17 pm

reproducing_data.zip

xchang1 commented 6 years ago

It looks like the distance index has 20 as the cap for the maximum distance so it thinks that the maximum distance is 20. This used to be the default but I changed it so if you re-make the distance index it should work

On Tue, Oct 23, 2018 at 1:57 PM Jordan Eizenga notifications@github.com wrote:

Assigned #1953 https://github.com/vgteam/vg/issues/1953 to @xchang1 https://github.com/xchang1.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1953#event-1921716063, or mute the thread https://github.com/notifications/unsubscribe-auth/AgxATO875VNcXONXXOcQGzHa86zawKWrks5un4LJgaJpZM4X2pcU .

jeizenga commented 6 years ago

Shouldn't the maximum distance cap only come into play if there are cycles in the graph? This is a snp1kg graph, so it should be a DAG

jeizenga commented 6 years ago

Also, it looks to be hard-coded in the master branch: https://github.com/vgteam/vg/blob/master/src/subcommand/index_main.cpp#L1137 Did you change the interface in an unmerged branch?

jeizenga commented 6 years ago

Also, I manually changed it to 100, but something is still going on. It gets an estimate of 119, but there's a path of exactly length 64, so shouldn't it just get 64? Moreover, there doesn't seem to be any path of length 119, and the tolerance is 37 so it should only report distances up to 111.

jeizenga commented 6 years ago

Update: the incorrect length is my own bug, my bad.

xchang1 commented 6 years ago

Shoot, sorry, you're right. I fixed that particular problem. There's still something wrong with the maximum distance in cyclic graphs but it should be fine for DAGs

xchang1 commented 6 years ago

jk, it should work for cyclic graphs too

jeizenga commented 6 years ago

Wait, I got another one. This time it gets a sdsl out-of-bounds error somewhere in the TVS algorithm. Reproducing files attached, the command is:

vg mpmap -x snp1kg-BRCA1.xg -g snp1kg-BRCA1.gcsa -d snp1kg-BRCA1.dist -s snp1kg-BRCA1.snarls -f read.fq -SBv -t 1 > /dev/null

[Uploading reproducing_example.zip…]()

jeizenga commented 6 years ago

I don't think the upload worked reproducing_example.zip

xchang1 commented 6 years ago

I'm having problems with git right now but its a one line change. Line 2817 is currently if ( next_best.first <= tolerance) { but it should be if (next_best.first != -1 && next_best.first <= tolerance) {

On Wed, Oct 24, 2018 at 8:07 PM Jordan Eizenga notifications@github.com wrote:

I don't think the upload worked reproducing_example.zip https://github.com/vgteam/vg/files/2513115/reproducing_example.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1953#issuecomment-432898690, or mute the thread https://github.com/notifications/unsubscribe-auth/AgxATNZdeGp8f-f6bUmZ8WgUjWIkJvQdks5uoSsCgaJpZM4X2pcU .

jeizenga commented 6 years ago

Okie, I'll throw it into my branch. Thanks!

jeizenga commented 6 years ago

New fun. I have it working now with the edit you sent, but I'm having performance issues using TVS for paired end reads.

The paired end resolution code requires much higher values for the tolerance. It seems like the combinatorial difficulty of the problem is really biting us in the ass with these parameters. I have a few instances of TVS that take on the order of 10 seconds to execute (already too long), and one that has been going for so long that it's also possible that it hit an infinite loop (but I expect it's just NP-Completeness rearing its head).

Anyway, I'm going to re-read the paper and see if I can spot any optimizations. Tomorrow we should try to go over the implementation if you have time.

xchang1 commented 6 years ago

Hooray. I have an optimization implemented that should speed things up a bit. My git repo is all tangled up though; I can untangle and PR it once I finish my homework