vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

pacbio alignment quality issues #679

Open shilpagarg opened 7 years ago

shilpagarg commented 7 years ago

pacbio-alignment-issue.zip Here are some examples with true and predicted PacBio alignments.

For true alignments, I used

vg sim -a -s 1337 -e .2 -n 15 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.gam

with different seeds.

and for vg alignments, I used

vg map -G sim-pacbio-reads.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -k 22 > pred-sim-pacbio-reads.gam

looks this option does not work at all......... you can see the attached files...

Therefore, I tried the following and outputs are attached:

vg map -r <( vg sim -s 1337 -e .2 -n 15 -x SK1+Y12.bwa_X.chrI.xg -l 10000) -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -k 22 > pred-sim-pacbio-reads.x.gam

In them, they have the issue of reverse complement, although we simulated them from DAG.

and here is the vg graph (DAG) from freebayes VCF file and its indexing.

https://transfer.sh/nA8LE/sk1-y12.bwa-x.chri.vg
vg index -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -k 8 -X 2  SK1+Y12.bwa_X.chrI.vg
ekg commented 7 years ago

To understand what is happening, we need to generate SK1+Y12.bwa_X.chrI.xg from the source .vg file.

Could you verify that the index you simulated from is the same one you are mapping against?

Would you please link the .vg file here?

shilpagarg commented 7 years ago

https://transfer.sh/nA8LE/sk1-y12.bwa-x.chri.vg

ekg commented 7 years ago

Is this how you generated the indexes?

vg index -x sk1-y12.bwa-x.chri.xg sk1-y12.bwa-x.chri.vg 
vg index -g sk1-y12.bwa-x.chri.gcsa -k 8 -X 2 -p sk1-y12.bwa-x.chri.vg
shilpagarg commented 7 years ago
vg index -x SK1+Y12.bwa_X.xg -g SK1+Y12.bwa_X.gcsa -k 8 -X 2
sk1-y12.bwa-x.chri.vg
ekg commented 7 years ago

I don't see the problem:

-> % vg map -G sim-pacbio-reads.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 1 -B 256
bbfd2fb212f3f68a        0.453383        0.4247  0       0
1c437cf4f2f5bb13        0.5032  0.4538  0       0
c8ebf88a03ae5111        0.479875        0.4423  0       0
b289d7c6e0094c63        0.521197        0.4564  0       0
bb5d129b8d632de9        0.522319        0.4658  0       0
d0afcdedddc33eeb        0.403837        0.3699  0       0
571daff0643bf224        0.38494 0.3563  0       0
a143f5b45b64d612        0.397287        0.3732  0       0
ea796ed9d234181c        0.396946        0.3779  0       0
....

These alignments overlap with the ground truth by around 40%.

ekg commented 7 years ago

Here, if the second column is 0 or close to it, we interpret the alignment as failed.

shilpagarg commented 7 years ago

Are you sure, this number is pretty good? I think 'is_reverse' thing has to be fixed. In DAG, we simulate in one orientation or I am missing something?

adamnovak commented 7 years ago

No, vg sim simulates from both strands, even in a DAG. It picks an orientation independently for each read.

ekg commented 7 years ago

I think it's normal because the alignments have 30% error. If they have 0% error then we expect something that is nearly 1. For example:

-> % vg sim -s 1337 -n 10 -l 10000 -x sk1-y12.bwa-x.chri.xg -a >perfect.gam                
erik@tappy [06:43:21 PM] [~/vg.8/test/pacbio-alignment-issue] [master *]
-> % vg map -G perfect.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 1 -B 256 
bbfd2fb212f3f68a        1       1       10000   0
2029084b84661bb7        0.92862 0.9967  0       0
339508f97860372d        1       1       10000   0
9181350a7fcbed5c        0.92604 1       0       0
9a74dc88cf9a4aff        1       1       10000   0
cae0d9e8aa3bc0d9        1       1       10000   0
daeed2917cf47a1f        1       1       10000   0
9f196f9fc4e21698        1       1       10000   0
1dae75115614f201        1       1       10000   0
d914e14036d42f09        1       1       10000   0
ekg commented 7 years ago

@adamnovak is correct, vg sim is sampling both strands. There is no option to simulate only from the forward strand. You can just filter the reversing reads out later if you'd like.

@shilpagarg please look at a single simulated alignment and its realignment. You can use vg view -dA to do this, or you can step through it by pretty-printing with jq and looking at each mapping in turn. You may find something that is problematic and only occurs in the case of high error rates. If you do we can look at that.

ekg commented 7 years ago

When we have >30% error it doesn't mean that we expect 70% alignment correspondence. Added errors will cause alignment failures (so bands become unaligned but anchored by their neighbors) or different alignment solutions than the "truth".

As far as I understand, this is all as we'd expect. If you can show a case where the alignment is radically wrong (but still has 50% exact overlap with the truth, as this sim shows) then we can reopen this issue.

ekg commented 7 years ago

Here's something interesting. It looks like we're failing a band with the shorter bandwidth:

2029084b84661bb7        0.92862 0.9967  0       0

But if I remap with different parameters, we get this alignment almost perfectly correct:

-> % vg map -G perfect.gam -x sk1-y12.bwa-x.chri.xg -g sk1-y12.bwa-x.chri.gcsa --compare -u 10 -B 1024
bbfd2fb212f3f68a        1       1       10000   0
2029084b84661bb7        0.9996  1       10000   0
339508f97860372d        1       1       10000   0
9181350a7fcbed5c        1       1       10000   0
9a74dc88cf9a4aff        1       1       10000   0
cae0d9e8aa3bc0d9        1       1       10000   0
daeed2917cf47a1f        1       1       10000   0
9f196f9fc4e21698        1       1       10000   0
1dae75115614f201        1       1       10000   0
d914e14036d42f09        1       1       10000   0
ekg commented 7 years ago

It looks like there is definitely a time/accuracy tradeoff that's occurring even with a change in bandwidth. Bigger bands will do better, as will more attempted multimaps.

ekg commented 7 years ago

Let's use this thread to discuss different parameters to vg map.

shilpagarg commented 7 years ago

So in this example (sim-pacbio-reads.1.gam) vg map -G sim-pacbio-reads.1.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare -u 1 -B 256

There is one read 'b4c6e9ffc8c19908' which is of bit less quality. Do you also think so?

b4c6e9ffc8c19908        0.1649  0.388   0       0

vg map -G sim-pacbio-reads.3.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare -u 1 -B 256

9e5ba732e6cc75a0        0.171882        0.4092  0       0
ekg commented 7 years ago

What about getting both alignments and the graph they touch using vg find, editing the graph, and viewing the result with the tube map viewer?

shilpagarg commented 7 years ago

@ekg For the first example, I generated json for true and predicted alignment as attached. true_pred_ex.zip

How can I convert json to GAM? I am using

vg view -J pred1.json -G > pred1.gam

I get

terminate called after throwing an instance of 'j2pb_error' what(): Unknown field: sequence Aborted

Moreover, I don't see how can I give GAM or json alignments as input here http://www.wolfib.com/sequenceTubeMap/

I pasted Json in custom data, nothing actually happens.

Does it work for you?

shilpagarg commented 7 years ago

@adamnovak I tried this:

vg find -G sim-pacbio-reads.1.gam -x SK1+Y12.bwa_X.chrI.xg > sim-pacbio-reads.1.subgraph.vg
vg view -d sim-pacbio-reads.1.subgraph.vg | dot -Tsvg - -o sim-pacbio-reads.1.subgraph.svg
vg view -a sim-pacbio-reads.1.gam > sim-pacbio-reads.1.gam.json

In this subgraph, I could see one node (101824, 42033, 41861) in particular at the start and ends, which are not present in the json alignment, but present in the subgraph. Or I am missing something?

PS: The data files are in this thread: pacbio-alignment-issue.zip

shilpagarg commented 7 years ago

@ekg I have BWA PacBio alignments to linear genome and GAM PacBio alignments to variation graph from VCF file. Is there any automated way in vg to compare the GAM alignments compared to BAMs, maybe by looking into the sequences?

shilpagarg commented 7 years ago

@jeizenga I am working with vg snarls and interested to get snarl traversals using

vg snarls -l -t -r SK1+Y12.bwa_X.chrI.trans SK1+Y12.bwa_X.chrI.vg > SK1+Y12.bwa_X.chrI.snarls

https://transfer.sh/nA8LE/sk1-y12.bwa-x.chri.vg

I found some traversals which looks like this:

{"snarl": {"start": {"node_id": 9092}, "end": {"node_id": 9094}}}
{"snarl": {"start": {"node_id": 9068}, "end": {"node_id": 9070}}}

I used -l and -t, so there should be atleast one visit right?

Here is the traversal output: https://transfer.sh/ZgRbI/sk1-y12.bwa-x.chri.trans

@glennhickey how can I sort traversals according to DAG vg graph? NodeIDs can be random, may not be in some defined order. Is this feature already implemented?

ekg commented 7 years ago

You can compare the alignments using scripts in the scripts directory. See map-sim for an outline. It will need to be changed to work for pacbio.

I don't think it is necessary to do this though because it appears that the alignment is working.

By the way you convert from JSON to GAM using vg view -JaG -. We should make a less confusing parameter for that.

ekg commented 7 years ago

I should have said: you can compare the bwa and vg alignments using that script.

shilpagarg commented 7 years ago

@ekg @adamnovak Here the reads that I am interested to align to vg graph, all these reads are unique: https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq

Then I align reads to vg using

vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam
Then sort them.
vg view -a yeast_pacbio.chrI.sorted.gam | jq -c '.name' | grep "m150910_220412_00127_c100822732550000001823176011031537_s1_p0/96667/0_11798"

Why there are multiple alignments corresponding to single read,although I don't use flag -K?

ekg commented 7 years ago

How many reads go in and how many come out?

On Wed, Mar 1, 2017, 10:47 AM shilpagarg notifications@github.com wrote:

@ekg https://github.com/ekg @adamnovak https://github.com/adamnovak Here the reads that I am interested to align to vg graph, all these reads are unique: https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq

Then I align reads to vg using

vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam Then sort them. vg view -a yeast_pacbio.chrI.sorted.gam | jq -c '.name' | grep "m150910_220412_00127_c100822732550000001823176011031537_s1_p0/96667/0_11798"

Why there are multiple alignments corresponding to single read,although I don't use flag -K?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283293592, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EfTswD1tGAa85prpLT_3Llj-R7zDks5rhT6ZgaJpZM4MI85z .

shilpagarg commented 7 years ago

I think the reason is the clean up. Please make sure in the output that it does not get concatenated with the previous results.

shilpagarg commented 7 years ago

As you suggested, I tried:

# consists of poor quality alignment from sim-pacbio-reads.1.gam
cat pred1.gam test1.gam > both.gam 
vg find -G both.gam -x SK1+Y12.bwa_X.chrI.xg > sub_test.vg
vg mod -i both.gam sub_test.vg > mod.vg

I get this error:

vg: src/vg.cpp:5062: void vg::VG::add_nodes_and_edges(const vg::Path&, const std::map<std::tuple<long int, bool, long unsigned int>, vg::Node*>&, std::map<std::pair<std::tuple<long int, bool, long unsigned int>, std::basic_string<char> >, std::vector<vg::Node*> >&, std::map<vg::Node*, vg::Path>&, const std::map<long int, long unsigned int>&, std::set<vg::NodeSide>&, size_t): Assertion `!paths.has_path(path.name())' failed.
Aborted

Therefore, I tried:

vg view -d -A both.gam SK1+Y12.bwa_X.chrI.xg > both.dot

And here is attached the SVG file for visualization for true and vg alignment for read "b4c6e9ffc8c19908 both.svg.zip "

shilpagarg commented 7 years ago

Next, I aligned the reads https://transfer.sh/wGAe6/yeast-pacbio.chri.x.fastq to DAG (from freebayes vcf) using

vg map -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa -f yeast_pacbio.chrI.x.fastq > yeast_pacbio.chrI.gam

Below are some alignments in which the alignment traverses the same snarl again. Or it is expected? pacbio_dag_issue.zip

and their corresponding BWA alignments: bwa-alns.zip

shilpagarg commented 7 years ago

@ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.

profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare

Following reads have a bit of poor quality:

sim-pacbio-reads.45.gam f9ca55c5b8ac8260 0.229091 0.3544 0 0

sim-pacbio-reads.77.gam b4c6e9ffc8c19908 0.145549 0.3501 0 0 test-cases.zip

ekg commented 7 years ago

The mod failure is because the alignments have the same name. Vg mod won't include two paths of the same name.

Using sed and jq you can modify one so it has a different name.

Maybe we can handle this differently.

On Thu, Mar 2, 2017, 8:10 AM shilpagarg notifications@github.com wrote:

@ekg https://github.com/ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.

profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare

Following reads have a bit of poor quality:

sim-pacbio-reads.45.gam f9ca55c5b8ac8260 0.229091 0.3544 0 0

sim-pacbio-reads.77.gam b4c6e9ffc8c19908 0.145549 0.3501 0 0 test-cases.zip https://github.com/vgteam/vg/files/812979/test-cases.zip

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283574816, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec-3jHrSWZE_VyQEfma7FlArIlmAks5rhmtjgaJpZM4MI85z .

ekg commented 7 years ago

Also you can try one alignment at a time. This may also be easier to visualize.

On Thu, Mar 2, 2017, 9:03 AM Erik Garrison erik.garrison@gmail.com wrote:

The mod failure is because the alignments have the same name. Vg mod won't include two paths of the same name.

Using sed and jq you can modify one so it has a different name.

Maybe we can handle this differently.

On Thu, Mar 2, 2017, 8:10 AM shilpagarg notifications@github.com wrote:

@ekg https://github.com/ekg Please find attached some test cases for pacbio alignments where overlap (second column) is atleast 40% in case you are interested to add in vg.

profile looks as: vg sim -a -s ${x} -e .2 -n 1 -x SK1+Y12.bwa_X.chrI.xg -l 10000 > sim-pacbio-reads.${x}.gam vg map -G sim-pacbio-reads.${x}.gam -x SK1+Y12.bwa_X.chrI.xg -g SK1+Y12.bwa_X.chrI.gcsa --compare

Following reads have a bit of poor quality:

sim-pacbio-reads.45.gam f9ca55c5b8ac8260 0.229091 0.3544 0 0

sim-pacbio-reads.77.gam b4c6e9ffc8c19908 0.145549 0.3501 0 0 test-cases.zip https://github.com/vgteam/vg/files/812979/test-cases.zip

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/679#issuecomment-283574816, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec-3jHrSWZE_VyQEfma7FlArIlmAks5rhmtjgaJpZM4MI85z .

shilpagarg commented 7 years ago

@ekg I am facing several issues to visualize two alignments: true and predicted on graph touched by alignments only.

vg find -G test_sim-pacbio-reads.45.gam -x SK1+Y12.bwa_X.chrI.xg > test_sim-pacbio-reads.45.vg
vg mod -i test_sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.vg > test_sim-pacbio-reads.45.mod.vg
vg view -d -A test_sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.mod.vg -S > test_sim-pacbio-reads.45.dot
neato -Tsvg test_sim-pacbio-reads.45.dot -o test_sim-pacbio-reads.45.svg

It hangs up at the last command, but does not give any out of memory issue.

test_sim-pacbio-reads.45.gam.zip

shilpagarg commented 7 years ago

@ekg Here is the visualization on base graph without augmenting PacBio alignments for example: sim-pacbio-reads.45.gam test_sim-pacbio-reads.45.svg.zip

ekg commented 7 years ago

image

This is a screenshot of what you sent me. This is the correct way to do this. You don't want to augment the graph to display the alignments. Augmenting the graph with the alignments turns them into paths in the graph and destructively modifies the graph so that the coordinate space no longer matches that of the alignments.

The red nodes indicate parts of the alignment that have very little match with the underlying graph. These could comprise most of the sequence in the alignment. It's not possible to tell what is going on with this visualization. You will at very least need to add in the true alignment. That should make it very clear what is going on.

Have you looked at the JSON of the alignment to see what is going on in these places?

shilpagarg commented 7 years ago

FYI: the true alignment is f9ca55c5b8ac8260`` and I took the example I last pointed out.