medvedevgroup / TwoPaCo

A fast constructor of the compressed de Bruijn graph from many genomes
Other
40 stars 10 forks source link

Question about GFA format #10

Open rob-p opened 7 years ago

rob-p commented 7 years ago

Hi @ilyaminkin,

It's me again :). TwoPaCo has been working great, but I've run into a small issue regarding the GFA file. I was wondering if you could clear up my confusion. I build a cdBG using TwoPaCo with k=31. As the document states that k is the node size, I'm expecting the cdBG to contain a list of segments (i.e., contigs) that overlap by k-1. However, in the resulting GFA file, all of the contigs seem to instead overlap by k (i.e., they show a 31M overlap). This is causing some issues downstream, as we expect the invariant that a k-mer (or its reverse complement) appears at most once in the cdBG. However, when the overlap is of size k, we get that a given k-mer may appear as many times as it participates in an overlap.

Have I misunderstood something about the expected format of this graph? Is there an easy way to obtain the cdBG GFA file such that the overlaps are retained as k-1 bases instead of k?

Thanks! Rob

iminkin commented 7 years ago

Hi @rob-p ,

I understand you confusion. The issue is that initially we adopted the edge-centric definition of the graph, i.e. sequences are spelled by edges, with nodes of size $k$ and edges of size $k + 1$. It is due to historical reasons and having a specific application in mind. But in GFA sequences are spelled by nodes, and edges merely indicate overlap. To output GFA, TwoPaCo turns compacted edges of the graph into nodes (segments in GFA terminology), hence they are of size at least $k + 1$ and overlap is $k$. So if you intend to get a node-centric graph with length of nodes $k$, run TwoPaCo with $k - 1$ if it is possible.

Again, sorry for the confusion, I am aware that it pops up all the time (https://www.biostars.org/p/175058/). I have plans to improve documentation to clear things out (I even put it in for 0.9.3: https://github.com/medvedevgroup/TwoPaCo/blob/master/NEWS.md). I just didn't expect people to start using TwoPaCo right away :)

rob-p commented 7 years ago

Hi @ilyaminkin,

Yup, I understand the confusion here as well. We have often gone back and forth between preferring the node and edge-centric view of the dBG.

I guess my concern with the proposed temporary solution (running with $k-1$) is that we want nodes to have an odd size, so that $k-1$ will always be even. For example, we want nodes of size $k=31$, so I'd have to run TwoPaCo with $k=30$. According to the documentation, $k$ must be odd. Is this, in fact, the case?

Thanks for the quick responses! Rob

iminkin commented 7 years ago

@rob-p I was afraid the odd/even issue was going to pop-up. I will think about it and try to make a fix soon.