marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
88 stars 6 forks source link

De-Brujin graph flag #49

Open heringerp opened 1 month ago

heringerp commented 1 month ago

It might make sense to introduce a parameter to tell panacus whether a graph is a variation graph or a De-Brujin graph. With this we could make sure all commands work for both types of graphs or at least tell the user if something won't work. Also we could change the debug/warning statements we discussed on 2024-10-15 conditional on the graph type.

Do you agree with this @lucaparmigiani? Is there anything else to think about?

lucaparmigiani commented 1 month ago

Yes, I think this is a very nice idea.

Running panacus on compact de Bruijn graph requires the parameter k. Of course, we could retrieve it by the matching of the links in the GFA, but it might be better just to give it as a parameter. Apart from this, I dont think there is any difference for the user.

danydoerr commented 1 month ago

If the user informs panacus that the input graph is c(c)DBG, then for what it's worth, we can assume that k is equal to the length of the shortest node in the graph. That can be identified very fast, so no need for the user to specify explicitly.

lucaparmigiani commented 1 month ago

This is not the case necessarily. Since it is a compacted graph you might have that all nodes are > k. The most reliable way is using links: L 3073 - 758274 + 10M L 3073 - 962680 + 10M ... Since all links will have the same matching (in this case 10M means that k is 11 for example)

heringerp commented 1 month ago

But this can still be checked very fast, right? So their would still be no need for letting the user specifiy k?

lucaparmigiani commented 1 month ago

Yes, but the problem is that the matching cigar is optional:

http://gfa-spec.github.io/GFA-spec/GFA1.html

danydoerr commented 1 month ago

Thanks for pointing this out, @lucaparmigiani! Does Bifrost output the CIGAR string?

lucaparmigiani commented 1 month ago

Yes. Bifrost outputs the cigar. I am not so sure about other tools.

danydoerr commented 1 month ago

If no cigar string is given, the default assumption in GFA format is that the edge is blunt. I feel like we should assume that if c(c)DBGs are given, they must provide the cigar.

But yeah, an option that specifies k in absence of the cigar won't hurt either, would it?

lucaparmigiani commented 1 month ago

I agree. So if there is the parameter passed we just assume we want to parse it as a cdbg, otherwise we just base ourselves on the cigar of the first link.