Open fbemm opened 5 years ago
The problem is that paired end mapping is not supported efficiently in graphs with complex topoplogy. Unfortunately, there is no obvious solution to this. Maybe we can attempt a few different approaches here.
I suggest mapping single ended. Does this work for you?
This maybe can be resolved with a new kind of path free positional index.
Another way would be to bail out in clustering if too many paired end candidates are found.
A final approach would be to map single ended and try to clean up the result in cases like this.
In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.
At very least, I should develop a warning that this pattern is occuring.
On Thu, Apr 25, 2019, 09:34 Felix Bemm notifications@github.com wrote:
Please describe:
- What you were trying to do
I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:
nodes 4977340 edges 5561744 cyclic
- What you wanted to happen
I wanted to map 10k read pairs to the graph and create a GAM file.
- What actually happened
vg map runs forever on all threads specified and starts consuming more than 512Gb.
- What data and command line to use to make the problem recur, if applicable
vg view -F -g pangenome.gfa > pangenome.vg vg mod -X 32 pangenome.vg > pangenome.mod.vg vg index -x pangenome.xg pangenome.mod.vg vg prune pangenome.mod.vg > pangenome.mod.pruned.vg vg index -g pangenome.gcsa pangenome.mod.pruned.vg -b tmp/ vg map -x pangenome.xg -g pangenome.gcsa -t 16 -f P1.fastq -f P2.fastq > map.gam
The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK7LPTS2LLVBRTSR5TPSFNG5ANCNFSM4HIKO2OA .
I should clarify that the problem may also be that the rescue code will pick up huge regions of the graph to try to rescue, and then align the read to these. I have not profiled this application in a while.
If this is happening, you may be able to improve things slightly by using the --xdrop-align parameter. I'm curious if this makes any difference.
FWIW, you can track progress by writing JSON or using the --debug flag on VG map. This should clarify what is happening.
If you can share this test case, I would be happy to take a quick look.
On Thu, Apr 25, 2019, 10:19 Erik Garrison erik.garrison@gmail.com wrote:
The problem is that paired end mapping is not supported efficiently in graphs with complex topoplogy. Unfortunately, there is no obvious solution to this. Maybe we can attempt a few different approaches here.
I suggest mapping single ended. Does this work for you?
This maybe can be resolved with a new kind of path free positional index.
Another way would be to bail out in clustering if too many paired end candidates are found.
A final approach would be to map single ended and try to clean up the result in cases like this.
In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.
At very least, I should develop a warning that this pattern is occuring.
On Thu, Apr 25, 2019, 09:34 Felix Bemm notifications@github.com wrote:
Please describe:
- What you were trying to do
I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:
nodes 4977340 edges 5561744 cyclic
- What you wanted to happen
I wanted to map 10k read pairs to the graph and create a GAM file.
- What actually happened
vg map runs forever on all threads specified and starts consuming more than 512Gb.
- What data and command line to use to make the problem recur, if applicable
vg view -F -g pangenome.gfa > pangenome.vg vg mod -X 32 pangenome.vg > pangenome.mod.vg vg index -x pangenome.xg pangenome.mod.vg vg prune pangenome.mod.vg > pangenome.mod.pruned.vg vg index -g pangenome.gcsa pangenome.mod.pruned.vg -b tmp/ vg map -x pangenome.xg -g pangenome.gcsa -t 16 -f P1.fastq -f P2.fastq > map.gam
The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK7LPTS2LLVBRTSR5TPSFNG5ANCNFSM4HIKO2OA .
In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.
With DBG you mean De Bruijn graph? I am using a SibeliaZ version that does not have overlaps anymore, so should be a string graph.
Single end mapping worked.
real 1m24.019s
user 7m3.273s
sys 0m46.564s
Would it help to supply a fragment model, limit mate rescue lower -u?
--xdrop-alignment
is causing vg using more than 850Gb memory.
You can try. I would test different options and look at the debug log to see when it stops getting stuck.
My guess is that every rescue attempt in a repeat is getting confused in some way.
How long are the input reads to the string graph? They are whole genomes right? Are you able to visualize the graph with bandage? That can give some perspective on what is happening.
On Thu, Apr 25, 2019, 11:28 Felix Bemm notifications@github.com wrote:
In any case, given that your graph was a DBG initially, it may not be very informative to map paired end.
With DBG you mean De Bruijn graph? I am using a SibeliaZ version that does not have overlaps anymore, so should be a string graph.
Single end mapping worked.
real 1m24.019s user 7m3.273s sys 0m46.564s
Would it help to supply a fragment model, limit mate rescue lower -u?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-486597922, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQENZWSON7BZFCGKT4OLPSF2TZANCNFSM4HIKO2OA .
Input to the string graph are fully assembled genomes with N50s between 100kb and 2Mb. Visualization works locally but not on the entire graph since there are quite some SVs and SNPs. At least my 16Gb memory is not able to load everything. I can try to run this on a bigger node though. Running mapping in debug mode returns mappings. Slower as expected, roughly 100 reads per minute. I am attaching the first part. Going to run it for a while and check whether there are specific reads causing a greater issue.
"1 1 MEMs covering 90 151 @ cluster1: |1622302+:15,1,AAGACCTGTAAAAAGTTGCTAGACGTGCTTATTGTTAATTACACGTATAACACGAATAAATACAACTATCCCTTACTTAATTTAGTTAGT cluster2: |1516235-:24,2,CGAGACGCCAAACATGTGATAGCAAACGGACGGAGAGAAAGGAAAAGAAGACTAAGGAATACGCACGTGATATTGGACGAGGTGAAGGAGGGGAGTATTGTAAAGGGCTACGCCAGAGAGGCATGACCTACAGGTC ACATGGCAGTGAATC"
Looks like there are some troubling regions very early.
One other solution would be the deduplication of the MAF file that is converted to GFA and vg later.
Intermediate update:
1) I tried different fragment models - primarily 750:350:150:0:1 (w/ update and wo/)
That did not really impact the mapping speed.
2) I tweaked mapping parameters to bwa mem
like behaviour (vg map -D -j -k 19 -w 100 -M 50 -c 500 -t 16
)
Mapping is faster now but still 10k PE reads need several hours @ 16 threads.
@fbemm, if you unpack the GAM into JSON with vg view -a
, you can see the amount of time consumed per read. If you plot this, I hope you'll find a group of outlier reads that are taking most of the time. Maybe we can figure out how to break the pair-related process that makes them take so long.
If it's happening for everything, then it might have something to do with the clustering model.
@ekg I went back to my graph construction and created a acyclic version, added unaligned sequences via construct and ids. PE mapping works like charm now. Augment and call are running now. Seems as the trouble maker was the cyclic graph that I fed into vg not the actual sequence content, since this is still the same now.
Out of curiosity, how did you build the acyclic graph?
On Fri, May 10, 2019 at 10:37 AM Felix Bemm notifications@github.com wrote:
Closed #2232 https://github.com/vgteam/vg/issues/2232.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#event-2332449377, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO3AO4O3QRB3QCGYSDPUUX5PANCNFSM4HIKO2OA .
And this makes it seem like the distance estimation stuff, or the graph unrolling in the case of paired end mapping is the culprit.
On Fri, May 10, 2019 at 10:41 AM Erik Garrison erik.garrison@gmail.com wrote:
Out of curiosity, how did you build the acyclic graph?
On Fri, May 10, 2019 at 10:37 AM Felix Bemm notifications@github.com wrote:
Closed #2232 https://github.com/vgteam/vg/issues/2232.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#event-2332449377, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO3AO4O3QRB3QCGYSDPUUX5PANCNFSM4HIKO2OA .
Main framework is/was reveal (https://github.com/jasperlinthorst/reveal) with some adoptions. Basically first projection of assemblies to a common reference, chromosome-wise graph generation, refinement and than vg. Augment work for a single sample. Are you guys doing the augmentation sample by sample (taking sample relations into account) or can that be done all in one step if one does not want to include the path from the aligments?
Right, reveal will produce a DAG. I'm not quite sure I follow what you mean about augmentation. What is the workflow wrt. augmentation?
On Fri, May 10, 2019 at 12:24 PM Felix Bemm notifications@github.com wrote:
Main framework is/was reveal (https://github.com/jasperlinthorst/reveal) with some adoptions. Basically first projection of assemblies to a common reference, chromosome-wise graph generation, refinement and than vg. Augment work for a single sample. Are you guys doing the augmentation sample by sample (taking sample relations into account) or can that be done all in one step if one does not want to include the path from the aligments?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-491240591, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEJEO7PY75HDTCFBN6LPUVEORANCNFSM4HIKO2OA .
Completely overlooked that call works only on a per sample basis. Is that correct?
That's right. It's models a single diploid sample.
On Fri, May 10, 2019, 12:56 Felix Bemm notifications@github.com wrote:
Completely overlooked that call works only on a per sample basis. Is that correct?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232#issuecomment-491248290, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEIW6F23IAYKSDQKSYDPUVIEHANCNFSM4HIKO2OA .
@ekg I just realized that the following command runs but actually does not what it aims for:
vg map -j -p -x pg.xg -g pg.gcsa -t 32 -f R1.fastq R2.fastq > pg.json
Shouldn't there be some sort of warning to avoid this?
Instead it should be:
vg map -j -p -x pg.xg -g pg.gcsa -t 32 -f R1.fastq -f R2.fastq > pg.json
Right? Sadly, with the latter vg map
does not report any PE mappings again.
You don't get any paired end mappings to the graph?
On Tue, Jun 25, 2019, 04:14 Felix Bemm notifications@github.com wrote:
Reopened #2232 https://github.com/vgteam/vg/issues/2232.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2232?email_source=notifications&email_token=AABDQEOGDZ2JST6XWZP3YLLP4HHVNA5CNFSM4HIKO2OKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOSE6I73Q#event-2436665326, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEP4UY6UQVFN33NJUOTP4HHVNANCNFSM4HIKO2OA .
What I found is that if I directly output bam from vg map
the pair-end reads are "paired" in bam correctly:
vg map -d indexes/NA12878.all -f 11.fastq -f 22.fastq --surject-to sam
However, if I output gam and then surject to bam, they are not handled correctly. Instead, they are "singled" in bam. For example,
vg map -d indexes/NA12878.all -f 11.fastq -f 22.fastq > out.gam
vg surject out.gam -x indexes/NA12878.all.xg -b
I am using vg: variation graph tool, version v1.30.0-4-gfddf5a0 "Carentino"
I think the problem is that vg surject
needs the -i
flag to produce paired BAMs
adding -i
did not work for me.
How are you determining that the reads aren't paired?
I figured out what had happened to me. I directly piped the gam to vg gamsort
and surjected the sorted gam. The -i
complained about the adjacent reads having different IDs. Thanks for the solution @jeizenga .
Please describe:
I am trying to map paired end short reads to a pangenome graph consisting of 11 50Mb genomes. The graph has the following specs after chopping:
I wanted to map 10k read pairs to the graph and create a GAM file.
vg map
runs forever on all threads specified and starts consuming more than 512Gb.The graph is created using a SibeliaZ GFA without removing blocks that show duplications and without removing blocks that contain the same sequence (which likely causes the graph to be cyclic). What looks a bit odd is the amount of components (173439) and the amount of subgraphs (9). Most of the components actually have only a single node.