maickrau / GraphAligner

MIT License
255 stars 30 forks source link

Simulated pacbio reads vg graph alignment problem #13

Open liaoherui opened 4 years ago

liaoherui commented 4 years ago

First of all, thanks for your wonderful tool ! It's really helpful for my research !

I build a vg graph from a collection of virus strain genomes (~10000bp per genome, some are very similar) , and I simulate the error free pacbio reads (1500bp per read) from one of the genomes. I align these reads to the graph with GraphAligner (1.0.9), and I find the the best alignment of some reads may be wrong cause these alignments don't contain the node of the path (refers to the reference genome be simulated).

In other words, it means the default best alignment of these reads can not cover the 1500 bp region of the simulated reference (one path in the graph). So I pick one read and output all the alignment to see if I can find one alignment that can cover the whole 1500 bp region of the reference. However, the highest one is 1368 bp that can be aligend to the region of the reference among all alignments.

In theory, the read should be aligned to the reference (path) in the graph with the whole 1500 bp...But I just can not get the ideal result. I also tried different mode 'Mum' and 'Mem'. None of them can output the 1500 bp alignment...

Is this a limitation of GraphAligner or even a bug? It will be really grateful if you can offer some possible reasons...(Btw, the path containing the simulated reference genome has '-' nodes, I wonder if this problem is related with this? )

liaoherui commented 4 years ago

I have tried GraphAligner (1.0.10) and the problem still happens....This problem is a little similar as issue4... But it's not a result of indel, the read actually is a substring of the graph and GraphAligner just doesn't give the perfect 1500bp alignment of the subpath.... To simplify this question, I draw a picture to describe it in more detail.. the_problem

maickrau commented 4 years ago

Hi, could you please upload the graph and the read?

liaoherui commented 4 years ago

Hi, could you please upload the graph and the read?

Hi, could you please upload the graph and the read?

Thanks for you reply ! I have a small simple example that there are only 2 paths in the variation graph and 65 reads simulated from one of the paths (Test_5797). The graph and simulated reads :

Graph_and_Reads.zip

And 2 reads can not get the perfect 1500bp alignment in my case.

maickrau commented 4 years ago

Thanks, fixed in 6bd39cb. There was an issue with minimizers sometimes reporting slightly wrong positions in the sequence, which caused the alignments to have extra indels. Let me know if it happens with other datasets

liaoherui commented 4 years ago

Thanks, fixed in 6bd39cb. There was an issue with minimizers sometimes reporting slightly wrong positions in the sequence, which caused the alignments to have extra indels. Let me know if it happens with other datasets

Hi, maickrau, thanks a lot for you help!

I find a similar problem just like before. I simulate 607 reads from one path called "HCV_3118" of the graph, however, only 59 of 607 can be aligned back to the graph with GraphAligner (1.0.10). I am not sure what's going on...Btw, I build the graph with sibeliaz.

The vg graph and simulated reads is uploaded to google drive and you can download it for test.

Graph_and_Reads

Any problems about the data, just let me know :)