Gramtools compatibility + clearer CLI and output files

bricoletc commented 3 years ago

In this PR I propose two things:

(Should be invisible for non-gramtools people): Change slightly how the binary PRG string is represented; gramtools represents an A/T snp as 5A6T6 (notice the last char is a 6, not a 5). It was a pain to represent as 5A6T5, I had to run scripts to modify. We might want to have the same representation across pandora and gramtools, but right now the binary string is only used by gramtools anyway i think.
I propose removing the -p/--prefix argument to from_msa subcommand, which enabled changing the 'sample name' of the output files, and adding in -o/--output_dir and -n/--prg_name arguments. I also propose no longer writing out max_nesting and min_match_length as prefixes to the prg output files. I find this much more intuitive to use.

Example: make_prg from_msa test.msa -o my_dir -n my_prg makes files my_prg.prg, my_prg.gfa and my_prg.bin inside a directory called my_dir. The parameters used are available in my_prg.log and in the fasta header of my_prg.prg.

Sensible defaults to -o and -n are provided: -o defaults to the directory where the MSA file is, and -n to the stem of the MSA filename (here, test)

I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.

mbhall88 commented 3 years ago

Regarding your first point, does this mean the .prg and .bin files are not the same?

Regarding the second point, I'm fine with the proposed parameter changes but would say the default for -o should be the current directory.

I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.

Yes, it would probably break existing workflows, but that just mean's it is a minor version bump - i.e. notifying users there is an API change. Also, just make sure we add all of these changes in the changelog

bricoletc commented 3 years ago

Yes for first point, the .bin and .prg do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.

I agree with your proposed change, added it in a commit and also added to the changelog

mbhall88 commented 3 years ago

Yes for first point, the .bin and .prg do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.

Ok. Well I guess as long as we make sure it is very clear to users how the two files differ (unless @leoisl @iqbal-lab or @rmcolq see any issues). And maybe add a script into the repo that can convert between the two if that isn't too much trouble?
Ultimately I think we should move both gramtools and pandora to the same input - I think we spoke about rGFA right?

bricoletc commented 3 years ago

I'll happily make a script, or produce a second file, that is the same as the .prg, if it's requested, but as the .bin is only used by gramtools currently, i'd rather wait until it's needed.

Yes we did talk about using rGFA, but I guess would need work then for our tools to parse. I'll create an issue though so we remember this

leoisl commented 3 years ago

Very sorry for the huge delay on answering this (just saw the mail as Brice merged these changes), last couple of weeks were complicated. All is fine with the PR by my POV, I just found a typo (th instead of the) in a comment, so nothing even deserving to open an issue to solve...

Answering some of the comments above, pandora relies a lot on the textual representation of the PRG, and is susceptible even to the spaces between site markers and sequences. This means that any change to this textual representation will incur changes in pandora. Changing from textual to binary representation will incur I guess lots of changes, but I do agree that the binary or rGFA representations are better than the textual one. Is it fine to keep producing this textual representation? Although there are better ways to represent the PRG, as already said, pandora just works with the textual representation and we have lots of higher priority issues to solve right now. Unifiying both pandora and gramtools to parse the same representation (binary or rGFA) is indeed the proper solution, but this would incur lots of changes in pandora for no performance or precision gain, so I'd say this is very low priority in pandora development. Any changes (CLI, or whatever) is fine from pandora perspective, as long as the textual representation keeps being produced.

bricoletc commented 3 years ago

That's what I though- no problem keeping the textual representation

As a PS, I pushed a bugfix to master as I let a bug slip in the binary file production (therefore only affected gramtools)

iqbal-lab-org / make_prg

Gramtools compatibility + clearer CLI and output files #22