vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Paired end mapping performance on a string graph #2232

Open fbemm opened 5 years ago

fbemm commented 5 years ago

Please describe:

  1. What you were trying to do

I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:

nodes   4977340
edges   5561744
cyclic
  1. What you wanted to happen

I wanted to map 10k read pairs to the graph and create a GAM file.

  1. What actually happened

vg map runs forever on all threads specified and starts consuming more than 512Gb.

  1. What data and command line to use to make the problem recur, if applicable
vg view -F -g pangenome.gfa > pangenome.vg
vg mod -X 32 pangenome.vg > pangenome.mod.vg
vg index -x pangenome.xg pangenome.mod.vg
vg prune pangenome.mod.vg > pangenome.mod.pruned.vg
vg index -g pangenome.gcsa pangenome.mod.pruned.vg -b tmp/
vg map -x pangenome.xg -g pangenome.gcsa -t 16 -f P1.fastq -f P2.fastq > map.gam

The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.

ekg commented 5 years ago

The problem is that paired end mapping is not supported efficiently in graphs with complex topoplogy. Unfortunately, there is no obvious solution to this. Maybe we can attempt a few different approaches here.

I suggest mapping single ended. Does this work for you?

This maybe can be resolved with a new kind of path free positional index.

Another way would be to bail out in clustering if too many paired end candidates are found.

A final approach would be to map single ended and try to clean up the result in cases like this.

In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.

At very least, I should develop a warning that this pattern is occuring.

On Thu, Apr 25, 2019, 09:34 Felix Bemm notifications@github.com wrote:

Please describe:

  1. What you were trying to do

I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:

nodes 4977340 edges 5561744 cyclic

  1. What you wanted to happen

I wanted to map 10k read pairs to the graph and create a GAM file.

  1. What actually happened

vg map runs forever on all threads specified and starts consuming more than 512Gb.

  1. What data and command line to use to make the problem recur, if applicable

vg view -F -g pangenome.gfa > pangenome.vg vg mod -X 32 pangenome.vg > pangenome.mod.vg vg index -x pangenome.xg pangenome.mod.vg vg prune pangenome.mod.vg > pangenome.mod.pruned.vg vg index -g pangenome.gcsa pangenome.mod.pruned.vg -b tmp/ vg map -x pangenome.xg -g pangenome.gcsa -t 16 -f P1.fastq -f P2.fastq > map.gam

The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK7LPTS2LLVBRTSR5TPSFNG5ANCNFSM4HIKO2OA .

ekg commented 5 years ago

I should clarify that the problem may also be that the rescue code will pick up huge regions of the graph to try to rescue, and then align the read to these. I have not profiled this application in a while.

If this is happening, you may be able to improve things slightly by using the --xdrop-align parameter. I'm curious if this makes any difference.

FWIW, you can track progress by writing JSON or using the --debug flag on VG map. This should clarify what is happening.

If you can share this test case, I would be happy to take a quick look.

On Thu, Apr 25, 2019, 10:19 Erik Garrison erik.garrison@gmail.com wrote:

The problem is that paired end mapping is not supported efficiently in graphs with complex topoplogy. Unfortunately, there is no obvious solution to this. Maybe we can attempt a few different approaches here.

I suggest mapping single ended. Does this work for you?

This maybe can be resolved with a new kind of path free positional index.

Another way would be to bail out in clustering if too many paired end candidates are found.

A final approach would be to map single ended and try to clean up the result in cases like this.

In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.

At very least, I should develop a warning that this pattern is occuring.

On Thu, Apr 25, 2019, 09:34 Felix Bemm notifications@github.com wrote:

Please describe:

  1. What you were trying to do

I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:

nodes 4977340 edges 5561744 cyclic

  1. What you wanted to happen

I wanted to map 10k read pairs to the graph and create a GAM file.

  1. What actually happened

vg map runs forever on all threads specified and starts consuming more than 512Gb.

  1. What data and command line to use to make the problem recur, if applicable

vg view -F -g pangenome.gfa > pangenome.vg vg mod -X 32 pangenome.vg > pangenome.mod.vg vg index -x pangenome.xg pangenome.mod.vg vg prune pangenome.mod.vg > pangenome.mod.pruned.vg vg index -g pangenome.gcsa pangenome.mod.pruned.vg -b tmp/ vg map -x pangenome.xg -g pangenome.gcsa -t 16 -f P1.fastq -f P2.fastq > map.gam

The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK7LPTS2LLVBRTSR5TPSFNG5ANCNFSM4HIKO2OA .

fbemm commented 5 years ago

In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.

With DBG you mean De Bruijn graph? I am using a SibeliaZ version that does not have overlaps anymore, so should be a string graph.

Single end mapping worked.

real    1m24.019s
user    7m3.273s
sys     0m46.564s

Would it help to supply a fragment model, limit mate rescue lower -u?

--xdrop-alignment is causing vg using more than 850Gb memory.

ekg commented 5 years ago

You can try. I would test different options and look at the debug log to see when it stops getting stuck.

My guess is that every rescue attempt in a repeat is getting confused in some way.

How long are the input reads to the string graph? They are whole genomes right? Are you able to visualize the graph with bandage? That can give some perspective on what is happening.

On Thu, Apr 25, 2019, 11:28 Felix Bemm notifications@github.com wrote:

In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.

With DBG you mean De Bruijn graph? I am using a SibeliaZ version that does not have overlaps anymore, so should be a string graph.

Single end mapping worked.

real 1m24.019s user 7m3.273s sys 0m46.564s

Would it help to supply a fragment model, limit mate rescue lower -u?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-486597922, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQENZWSON7BZFCGKT4OLPSF2TZANCNFSM4HIKO2OA .

fbemm commented 5 years ago

Input to the string graph are fully assembled genomes with N50s between 100kb and 2Mb. Visualization works locally but not on the entire graph since there are quite some SVs and SNPs. At least my 16Gb memory is not able to load everything. I can try to run this on a bigger node though. Running mapping in debug mode returns mappings. Slower as expected, roughly 100 reads per minute. I am attaching the first part. Going to run it for a while and check whether there are specific reads causing a greater issue.

debug.zip

fbemm commented 5 years ago

"1 1 MEMs covering 90 151 @ cluster1: |1622302+:15,1,AAGACCTGTAAAAAGTTGCTAGACGTGCTTATTGTTAATTACACGTATAACACGAATAAATACAACTATCCCTTACTTAATTTAGTTAGT cluster2: |1516235-:24,2,CGAGACGCCAAACATGTGATAGCAAACGGACGGAGAGAAAGGAAAAGAAGACTAAGGAATACGCACGTGATATTGGACGAGGTGAAGGAGGGGAGTATTGTAAAGGGCTACGCCAGAGAGGCATGACCTACAGGTC ACATGGCAGTGAATC"

Looks like there are some troubling regions very early.

fbemm commented 5 years ago

One other solution would be the deduplication of the MAF file that is converted to GFA and vg later.

fbemm commented 5 years ago

Intermediate update:

1) I tried different fragment models - primarily 750:350:150:0:1 (w/ update and wo/)

That did not really impact the mapping speed.

2) I tweaked mapping parameters to bwa mem like behaviour (vg map -D -j -k 19 -w 100 -M 50 -c 500 -t 16)

Mapping is faster now but still 10k PE reads need several hours @ 16 threads.

ekg commented 5 years ago

@fbemm, if you unpack the GAM into JSON with vg view -a, you can see the amount of time consumed per read. If you plot this, I hope you'll find a group of outlier reads that are taking most of the time. Maybe we can figure out how to break the pair-related process that makes them take so long.

If it's happening for everything, then it might have something to do with the clustering model.

fbemm commented 5 years ago

@ekg I went back to my graph construction and created a acyclic version, added unaligned sequences via construct and ids. PE mapping works like charm now. Augment and call are running now. Seems as the trouble maker was the cyclic graph that I fed into vg not the actual sequence content, since this is still the same now.

ekg commented 5 years ago

Out of curiosity, how did you build the acyclic graph?

On Fri, May 10, 2019 at 10:37 AM Felix Bemm notifications@github.com wrote:

Closed #2232 https://github.com/vgteam/vg/issues/2232.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#event-2332449377, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO3AO4O3QRB3QCGYSDPUUX5PANCNFSM4HIKO2OA .

ekg commented 5 years ago

And this makes it seem like the distance estimation stuff, or the graph unrolling in the case of paired end mapping is the culprit.

On Fri, May 10, 2019 at 10:41 AM Erik Garrison erik.garrison@gmail.com wrote:

Out of curiosity, how did you build the acyclic graph?

On Fri, May 10, 2019 at 10:37 AM Felix Bemm notifications@github.com wrote:

Closed #2232 https://github.com/vgteam/vg/issues/2232.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#event-2332449377, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO3AO4O3QRB3QCGYSDPUUX5PANCNFSM4HIKO2OA .

fbemm commented 5 years ago

Main framework is/was reveal (https://github.com/jasperlinthorst/reveal) with some adoptions. Basically first projection of assemblies to a common reference, chromosome-wise graph generation, refinement and than vg. Augment work for a single sample. Are you guys doing the augmentation sample by sample (taking sample relations into account) or can that be done all in one step if one does not want to include the path from the aligments?

ekg commented 5 years ago

Right, reveal will produce a DAG. I'm not quite sure I follow what you mean about augmentation. What is the workflow wrt. augmentation?

On Fri, May 10, 2019 at 12:24 PM Felix Bemm notifications@github.com wrote:

Main framework is/was reveal (https://github.com/jasperlinthorst/reveal) with some adoptions. Basically first projection of assemblies to a common reference, chromosome-wise graph generation, refinement and than vg. Augment work for a single sample. Are you guys doing the augmentation sample by sample (taking sample relations into account) or can that be done all in one step if one does not want to include the path from the aligments?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-491240591, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEJEO7PY75HDTCFBN6LPUVEORANCNFSM4HIKO2OA .

fbemm commented 5 years ago

Completely overlooked that call works only on a per sample basis. Is that correct?

ekg commented 5 years ago

That's right. It's models a single diploid sample.

On Fri, May 10, 2019, 12:56 Felix Bemm notifications@github.com wrote:

Completely overlooked that call works only on a per sample basis. Is that correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-491248290, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEIW6F23IAYKSDQKSYDPUVIEHANCNFSM4HIKO2OA .

fbemm commented 5 years ago

@ekg I just realized that the following command runs but actually does not what it aims for:

vg map -j -p -x pg.xg -g pg.gcsa -t 32 -f R1.fastq R2.fastq > pg.json

Shouldn't there be some sort of warning to avoid this?

Instead it should be:

vg map -j -p -x pg.xg -g pg.gcsa -t 32 -f R1.fastq -f R2.fastq > pg.json

Right? Sadly, with the latter vg map does not report any PE mappings again.

ekg commented 5 years ago

You don't get any paired end mappings to the graph?

On Tue, Jun 25, 2019, 04:14 Felix Bemm notifications@github.com wrote:

Reopened #2232 https://github.com/vgteam/vg/issues/2232.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232?email_source=notifications&email_token=AABDQEOGDZ2JST6XWZP3YLLP4HHVNA5CNFSM4HIKO2OKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOSE6I73Q#event-2436665326, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEP4UY6UQVFN33NJUOTP4HHVNANCNFSM4HIKO2OA .

ruolin commented 3 years ago

What I found is that if I directly output bam from vg map the pair-end reads are "paired" in bam correctly:

vg map -d indexes/NA12878.all -f 11.fastq -f 22.fastq --surject-to sam

However, if I output gam and then surject to bam, they are not handled correctly. Instead, they are "singled" in bam. For example,

vg map -d indexes/NA12878.all -f 11.fastq -f 22.fastq  > out.gam
vg surject out.gam -x indexes/NA12878.all.xg -b 

I am using vg: variation graph tool, version v1.30.0-4-gfddf5a0 "Carentino"

jeizenga commented 3 years ago

I think the problem is that vg surject needs the -i flag to produce paired BAMs

ruolin commented 3 years ago

adding -i did not work for me.

jeizenga commented 3 years ago

How are you determining that the reads aren't paired?

ruolin commented 3 years ago

I figured out what had happened to me. I directly piped the gam to vg gamsort and surjected the sorted gam. The -i complained about the adjacent reads having different IDs. Thanks for the solution @jeizenga .