Open shilpagarg opened 7 years ago
To understand what is happening, we need to generate SK1+Y12.bwa_X.chrI.xg
from the source .vg file.
Could you verify that the index you simulated from is the same one you are mapping against?
Would you please link the .vg file here?
Is this how you generated the indexes?
vg index -x sk1-y12.bwa-x.chri.xg sk1-y12.bwa-x.chri.vg
vg index -g sk1-y12.bwa-x.chri.gcsa -k 8 -X 2 -p sk1-y12.bwa-x.chri.vg
vg index -x SK1+Y12.bwa_X.xg -g SK1+Y12.bwa_X.gcsa -k 8 -X 2
sk1-y12.bwa-x.chri.vg
I don't see the problem:
-> % vg map -G sim-pacbio-reads.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 1 -B 256
bbfd2fb212f3f68a 0.453383 0.4247 0 0
1c437cf4f2f5bb13 0.5032 0.4538 0 0
c8ebf88a03ae5111 0.479875 0.4423 0 0
b289d7c6e0094c63 0.521197 0.4564 0 0
bb5d129b8d632de9 0.522319 0.4658 0 0
d0afcdedddc33eeb 0.403837 0.3699 0 0
571daff0643bf224 0.38494 0.3563 0 0
a143f5b45b64d612 0.397287 0.3732 0 0
ea796ed9d234181c 0.396946 0.3779 0 0
....
These alignments overlap with the ground truth by around 40%.
Here, if the second column is 0
or close to it, we interpret the alignment as failed.
Are you sure, this number is pretty good? I think 'is_reverse' thing has to be fixed. In DAG, we simulate in one orientation or I am missing something?
No, vg sim simulates from both strands, even in a DAG. It picks an orientation independently for each read.
I think it's normal because the alignments have 30% error. If they have 0% error then we expect something that is nearly 1. For example:
-> % vg sim -s 1337 -n 10 -l 10000 -x sk1-y12.bwa-x.chri.xg -a >perfect.gam
erik@tappy [06:43:21 PM] [~/vg.8/test/pacbio-alignment-issue] [master *]
-> % vg map -G perfect.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 1 -B 256
bbfd2fb212f3f68a 1 1 10000 0
2029084b84661bb7 0.92862 0.9967 0 0
339508f97860372d 1 1 10000 0
9181350a7fcbed5c 0.92604 1 0 0
9a74dc88cf9a4aff 1 1 10000 0
cae0d9e8aa3bc0d9 1 1 10000 0
daeed2917cf47a1f 1 1 10000 0
9f196f9fc4e21698 1 1 10000 0
1dae75115614f201 1 1 10000 0
d914e14036d42f09 1 1 10000 0
@adamnovak is correct, vg sim
is sampling both strands. There is no option to simulate only from the forward strand. You can just filter the reversing reads out later if you'd like.
@shilpagarg please look at a single simulated alignment and its realignment. You can use vg view -dA
to do this, or you can step through it by pretty-printing with jq and looking at each mapping in turn. You may find something that is problematic and only occurs in the case of high error rates. If you do we can look at that.
When we have >30% error it doesn't mean that we expect 70% alignment correspondence. Added errors will cause alignment failures (so bands become unaligned but anchored by their neighbors) or different alignment solutions than the "truth".
As far as I understand, this is all as we'd expect. If you can show a case where the alignment is radically wrong (but still has 50% exact overlap with the truth, as this sim shows) then we can reopen this issue.
Here's something interesting. It looks like we're failing a band with the shorter bandwidth:
2029084b84661bb7 0.92862 0.9967 0 0
But if I remap with different parameters, we get this alignment almost perfectly correct:
-> % vg map -G perfect.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 10 -B 1024
bbfd2fb212f3f68a 1 1 10000 0
2029084b84661bb7 0.9996 1 10000 0
339508f97860372d 1 1 10000 0
9181350a7fcbed5c 1 1 10000 0
9a74dc88cf9a4aff 1 1 10000 0
cae0d9e8aa3bc0d9 1 1 10000 0
daeed2917cf47a1f 1 1 10000 0
9f196f9fc4e21698 1 1 10000 0
1dae75115614f201 1 1 10000 0
d914e14036d42f09 1 1 10000 0
It looks like there is definitely a time/accuracy tradeoff that's occurring even with a change in bandwidth. Bigger bands will do better, as will more attempted multimaps.
Let's use this thread to discuss different parameters to vg map.
So in this example (sim-pacbio-reads.1.gam) vg map -G sim-pacbio-reads.1.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare -u 1 -B 256
There is one read 'b4c6e9ffc8c19908' which is of bit less quality. Do you also think so?
b4c6e9ffc8c19908 0.1649 0.388 0 0
vg map -G sim-pacbio-reads.3.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare -u 1 -B 256
9e5ba732e6cc75a0 0.171882 0.4092 0 0
What about getting both alignments and the graph they touch using vg find
, editing the graph, and viewing the result with the tube map viewer?
@ekg For the first example, I generated json for true and predicted alignment as attached. true_pred_ex.zip
How can I convert json to GAM? I am using
vg view -J pred1.json -G > pred1.gam
I get
terminate called after throwing an instance of 'j2pb_error' what(): Unknown field: sequence Aborted
Moreover, I don't see how can I give GAM or json alignments as input here http://www.wolfib.com/sequenceTubeMap/
I pasted Json in custom data, nothing actually happens.
Does it work for you?
@adamnovak I tried this:
vg find -G sim-pacbio-reads.1.gam -x SK1+Y12.bwa_X.chrI.xg > sim-pacbio-reads.1.subgraph.vg
vg view -d sim-pacbio-reads.1.subgraph.vg | dot -Tsvg - -o sim-pacbio-reads.1.subgraph.svg
vg view -a sim-pacbio-reads.1.gam > sim-pacbio-reads.1.gam.json
In this subgraph, I could see one node (101824, 42033, 41861) in particular at the start and ends, which are not present in the json alignment, but present in the subgraph. Or I am missing something?
PS: The data files are in this thread: pacbio-alignment-issue.zip
@ekg I have BWA PacBio alignments to linear genome and GAM PacBio alignments to variation graph from VCF file. Is there any automated way in vg to compare the GAM alignments compared to BAMs, maybe by looking into the sequences?
@jeizenga I am working with vg snarls and interested to get snarl traversals using
vg snarls -l -t -r SK1+Y12.bwa_X.chrI.trans SK1+Y12.bwa_X.chrI.vg > SK1+Y12.bwa_X.chrI.snarls
https://transfer.sh/nA8LE/sk1-y12.bwa-x.chri.vg
I found some traversals which looks like this:
{"snarl": {"start": {"node_id": 9092}, "end": {"node_id": 9094}}}
{"snarl": {"start": {"node_id": 9068}, "end": {"node_id": 9070}}}
I used -l and -t, so there should be atleast one visit right?
Here is the traversal output: https://transfer.sh/ZgRbI/sk1-y12.bwa-x.chri.trans
@glennhickey how can I sort traversals according to DAG vg graph? NodeIDs can be random, may not be in some defined order. Is this feature already implemented?
You can compare the alignments using scripts in the scripts directory. See map-sim for an outline. It will need to be changed to work for pacbio.
I don't think it is necessary to do this though because it appears that the alignment is working.
By the way you convert from JSON to GAM using vg view -JaG -
. We should make a less confusing parameter for that.
I should have said: you can compare the bwa and vg alignments using that script.
@ekg @adamnovak Here the reads that I am interested to align to vg graph, all these reads are unique: https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq
Then I align reads to vg using
vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam
Then sort them.
vg view -a yeast_pacbio.chrI.sorted.gam | jq -c '.name' | grep "m150910_220412_00127_c100822732550000001823176011031537_s1_p0/96667/0_11798"
Why there are multiple alignments corresponding to single read,although I don't use flag -K?
How many reads go in and how many come out?
On Wed, Mar 1, 2017, 10:47 AM shilpagarg notifications@github.com wrote:
@ekg https://github.com/ekg @adamnovak https://github.com/adamnovak Here the reads that I am interested to align to vg graph, all these reads are unique: https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq
Then I align reads to vg using
vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam Then sort them. vg view -a yeast_pacbio.chrI.sorted.gam | jq -c '.name' | grep "m150910_220412_00127_c100822732550000001823176011031537_s1_p0/96667/0_11798"
Why there are multiple alignments corresponding to single read,although I don't use flag -K?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283293592, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EfTswD1tGAa85prpLT_3Llj-R7zDks5rhT6ZgaJpZM4MI85z .
I think the reason is the clean up. Please make sure in the output that it does not get concatenated with the previous results.
As you suggested, I tried:
# consists of poor quality alignment from sim-pacbio-reads.1.gam
cat pred1.gam test1.gam > both.gam
vg find -G both.gam -x SK1+Y12.bwa_X.chrI.xg > sub_test.vg
vg mod -i both.gam sub_test.vg > mod.vg
I get this error:
vg: src/vg.cpp:5062: void vg::VG::add_nodes_and_edges(const vg::Path&, const std::map<std::tuple<long int, bool, long unsigned int>, vg::Node*>&, std::map<std::pair<std::tuple<long int, bool, long unsigned int>, std::basic_string<char> >, std::vector<vg::Node*> >&, std::map<vg::Node*, vg::Path>&, const std::map<long int, long unsigned int>&, std::set<vg::NodeSide>&, size_t): Assertion `!paths.has_path(path.name())' failed.
Aborted
Therefore, I tried:
vg view -d -A both.gam SK1+Y12.bwa_X.chrI.xg > both.dot
And here is attached the SVG file for visualization for true and vg alignment for read "b4c6e9ffc8c19908 both.svg.zip "
Next, I aligned the reads https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq
to DAG (from freebayes vcf)
using
vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam
Below are some alignments in which the alignment traverses the same snarl again. Or it is expected? pacbio_dag_issue.zip
and their corresponding BWA alignments: bwa-alns.zip
@ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.
profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare
Following reads have a bit of poor quality:
sim-pacbio-reads.45.gam
f9ca55c5b8ac8260 0.229091 0.3544 0 0
sim-pacbio-reads.77.gam
b4c6e9ffc8c19908 0.145549 0.3501 0 0
test-cases.zip
The mod failure is because the alignments have the same name. Vg mod won't include two paths of the same name.
Using sed and jq you can modify one so it has a different name.
Maybe we can handle this differently.
On Thu, Mar 2, 2017, 8:10 AM shilpagarg notifications@github.com wrote:
@ekg https://github.com/ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.
profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare
Following reads have a bit of poor quality:
sim-pacbio-reads.45.gam f9ca55c5b8ac8260 0.229091 0.3544 0 0
sim-pacbio-reads.77.gam b4c6e9ffc8c19908 0.145549 0.3501 0 0 test-cases.zip https://github.com/vgteam/vg/files/812979/test-cases.zip
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283574816, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec-3jHrSWZE_VyQEfma7FlArIlmAks5rhmtjgaJpZM4MI85z .
Also you can try one alignment at a time. This may also be easier to visualize.
On Thu, Mar 2, 2017, 9:03 AM Erik Garrison erik.garrison@gmail.com wrote:
The mod failure is because the alignments have the same name. Vg mod won't include two paths of the same name.
Using sed and jq you can modify one so it has a different name.
Maybe we can handle this differently.
On Thu, Mar 2, 2017, 8:10 AM shilpagarg notifications@github.com wrote:
@ekg https://github.com/ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.
profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare
Following reads have a bit of poor quality:
sim-pacbio-reads.45.gam f9ca55c5b8ac8260 0.229091 0.3544 0 0
sim-pacbio-reads.77.gam b4c6e9ffc8c19908 0.145549 0.3501 0 0 test-cases.zip https://github.com/vgteam/vg/files/812979/test-cases.zip
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283574816, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec-3jHrSWZE_VyQEfma7FlArIlmAks5rhmtjgaJpZM4MI85z .
@ekg I am facing several issues to visualize two alignments: true and predicted on graph touched by alignments only.
vg find -G test_sim-pacbio-reads.45.gam -x SK1+Y12.bwa_X.chrI.xg > test_sim-pacbio-reads.45.vg
vg mod -i test_sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.vg > test_sim-pacbio-reads.45.mod.vg
vg view -d -A test_sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.mod.vg -S > test_sim-pacbio-reads.45.dot
neato -Tsvg test_sim-pacbio-reads.45.dot -o test_sim-pacbio-reads.45.svg
It hangs up at the last command, but does not give any out of memory issue.
@ekg Here is the visualization on base graph without augmenting PacBio alignments for example: sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.svg.zip
This is a screenshot of what you sent me. This is the correct way to do this. You don't want to augment the graph to display the alignments. Augmenting the graph with the alignments turns them into paths in the graph and destructively modifies the graph so that the coordinate space no longer matches that of the alignments.
The red nodes indicate parts of the alignment that have very little match with the underlying graph. These could comprise most of the sequence in the alignment. It's not possible to tell what is going on with this visualization. You will at very least need to add in the true alignment. That should make it very clear what is going on.
Have you looked at the JSON of the alignment to see what is going on in these places?
FYI: the true alignment is f9ca55c5b8ac8260`` and I took the example I last pointed out.
pacbio-alignment-issue.zip Here are some examples with true and predicted PacBio alignments.
For true alignments, I used
with different seeds.
and for vg alignments, I used
looks this option does not work at all......... you can see the attached files...
Therefore, I tried the following and outputs are attached:
In them, they have the issue of reverse complement, although we simulated them from DAG.
and here is the vg graph (DAG) from freebayes VCF file and its indexing.