vgteam / xg

xg0: a simpler xg index
Other
5 stars 4 forks source link

Unexpected behavior on converting between gfa and xg #4

Open 6br opened 5 years ago

6br commented 5 years ago

I tried bin/xg built at commit hash 6871a1e011954483e01ace8a517a78ba1a57b7d9.

An input file test.gfa is following.

H       VN:Z:1.0
S       1       CAAATAAG
S       2       A
S       3       G
S       4       T
S       5       C
S       6       TTG
S       7       A
S       8       G
S       9       AAATTTTCTGGAGTTCTAT
S       10      A
S       11      T
S       12      ATAT
S       13      A
S       14      T
S       15      CCAACTCTCTG
L       1       +       2       +       0M
L       1       +       3       +       0M
L       2       +       4       +       0M
L       2       +       5       +       0M
L       3       +       4       +       0M
L       3       +       5       +       0M
L       4       +       6       +       0M
L       5       +       6       +       0M
L       6       +       7       +       0M
L       6       +       8       +       0M
L       7       +       9       +       0M
L       8       +       9       +       0M
L       9       +       10      +       0M
L       9       +       11      +       0M
L       10      +       12      +       0M
L       11      +       12      +       0M
L       12      +       13      +       0M
L       12      +       14      +       0M
L       13      +       15      +       0M
L       14      +       15      +       0M
P       x       1+,3+,5+,6+,8+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*
P       y       1+,2+,5+,6+,8+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*
P       z       1+,2+,5+,6+,7+,9+,11+,12+,14+,15+       *,*,*,*,*,*,*,*,*

I run the following commands on a shell.

$ bin/xg -o test.xg -g test.gfa
$ bin/xg -i test.xg --gfa-out

After that, I found that the node 15+ on the path z was truncated.

P       x       1+,3+,5+,6+,8+,9+,11+,12+,14+,15+       8M,1M,1M,3M,1M,19M,1M,4M,1M,11M
P       y       1+,2+,5+,6+,8+,9+,11+,12+,14+,15+       8M,1M,1M,3M,1M,19M,1M,4M,1M,11M
P       z       1+,2+,5+,6+,7+,9+,11+,12+,14+   8M,1M,1M,3M,1M,19M,1M,4M,1M
ekg commented 5 years ago

Your input is incorrect. There are only 9 * elements, but 10 path elements.

It's annoying that we have to keep these two lists in sync. Maybe we can fix that in rGFA.

6br commented 5 years ago

I don't think it is incorrect because it obeys the GFA1 spec. According to the spec, the 4th column means overlaps (between nodes on a path). As long as the path is linear, the number of overlaps between nodes is len(nodes) -1. So, it is natural that there are 9 elements. The example at the end of https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md is similar. I hope such kinds of ambiguity can be resolved in rGFA.

ekg commented 5 years ago

I thought this was the cigar between the path step and the node. If so, this means none of the GFA we have been making is correct.

On Thu, Aug 22, 2019, 07:01 Toshiyuki Yokoyama notifications@github.com wrote:

I don't think it is incorrect because it obeys the GFA1 spec. According to the spec, the 4th column means overlaps (between nodes on a path). As long as the path is linear, the number of overlaps between nodes is len(nodes) -1. So, it is natural that there are 9 elements. The example at the end of https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md is similar. I hope such kinds of ambiguity can be resolved in rGFA.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vgteam/xg/issues/4?email_source=notifications&email_token=AABDQEO34PLTHMTTDITTR33QFYMTVA5CNFSM4H4L77R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD434KCQ#issuecomment-523748618, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEOJENTCUKGEGXCJ4K3QFYMTVANCNFSM4H4L77RQ .

ekg commented 5 years ago

Thanks for pointing this out. I guess the overlaps are being stored in the path because they aren't determined based on the graph topology of an assembly graph.

This does mean that all our GFA P lines are broken. But the fact that we weren't using these fields for any purpose also indicates how useless they were for our applications. In graphs with paths, these overlap/cigar descriptions are hugely expensive. I would love to get rid of them or make them optional. Perhaps *,*,*... is the best we can do. It's a required field. But, what tools actually use it? As far as I know, only variation graph tools care about the paths.

ekg commented 5 years ago

That said, the current setup of the gfakluge parser used by xg should work for the correct format and correctly parses your example.

6br commented 5 years ago

Thank you for considering my comment. The reason why we faced this problem is that https://github.com/graph-genome/vgbrowser uses pygfa to exchange data between graph genome browser and xg via GFA format as intermediate files, currently. Since pygfa raises errors for such differences in records of GFA, I feel the restriction of pygfa is a little too strong for practical use cases. Therefore, I would appreciate it if we could replace our current implementation with direct communication to lightweight xg server. We would be free from the differences between GFA parsers if it goes well.