iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

Gramtools compatibility + clearer CLI and output files #22

Closed bricoletc closed 3 years ago

bricoletc commented 3 years ago

In this PR I propose two things:

I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.

mbhall88 commented 3 years ago

Regarding your first point, does this mean the .prg and .bin files are not the same?

Regarding the second point, I'm fine with the proposed parameter changes but would say the default for -o should be the current directory.

I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.

Yes, it would probably break existing workflows, but that just mean's it is a minor version bump - i.e. notifying users there is an API change. Also, just make sure we add all of these changes in the changelog

bricoletc commented 3 years ago

Yes for first point, the .bin and .prg do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.

I agree with your proposed change, added it in a commit and also added to the changelog

mbhall88 commented 3 years ago

Yes for first point, the .bin and .prg do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.

Ok. Well I guess as long as we make sure it is very clear to users how the two files differ (unless @leoisl @iqbal-lab or @rmcolq see any issues). And maybe add a script into the repo that can convert between the two if that isn't too much trouble?
Ultimately I think we should move both gramtools and pandora to the same input - I think we spoke about rGFA right?

bricoletc commented 3 years ago

I'll happily make a script, or produce a second file, that is the same as the .prg, if it's requested, but as the .bin is only used by gramtools currently, i'd rather wait until it's needed.

Yes we did talk about using rGFA, but I guess would need work then for our tools to parse. I'll create an issue though so we remember this

leoisl commented 3 years ago

Very sorry for the huge delay on answering this (just saw the mail as Brice merged these changes), last couple of weeks were complicated. All is fine with the PR by my POV, I just found a typo (th instead of the) in a comment, so nothing even deserving to open an issue to solve...

Answering some of the comments above, pandora relies a lot on the textual representation of the PRG, and is susceptible even to the spaces between site markers and sequences. This means that any change to this textual representation will incur changes in pandora. Changing from textual to binary representation will incur I guess lots of changes, but I do agree that the binary or rGFA representations are better than the textual one. Is it fine to keep producing this textual representation? Although there are better ways to represent the PRG, as already said, pandora just works with the textual representation and we have lots of higher priority issues to solve right now. Unifiying both pandora and gramtools to parse the same representation (binary or rGFA) is indeed the proper solution, but this would incur lots of changes in pandora for no performance or precision gain, so I'd say this is very low priority in pandora development. Any changes (CLI, or whatever) is fine from pandora perspective, as long as the textual representation keeps being produced.

bricoletc commented 3 years ago

That's what I though- no problem keeping the textual representation

As a PS, I pushed a bugfix to master as I let a bug slip in the binary file production (therefore only affected gramtools)