Closed bricoletc closed 3 years ago
Regarding your first point, does this mean the .prg
and .bin
files are not the same?
Regarding the second point, I'm fine with the proposed parameter changes but would say the default for -o
should be the current directory.
I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.
Yes, it would probably break existing workflows, but that just mean's it is a minor version bump - i.e. notifying users there is an API change. Also, just make sure we add all of these changes in the changelog
Yes for first point, the .bin
and .prg
do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.
I agree with your proposed change, added it in a commit and also added to the changelog
Yes for first point, the
.bin
and.prg
do represent the PRG using a different end-of-site marker. I don't know how much pandora actually uses the integer markers; do they just serve to construct a graph representation? In gramtools, they are also used for backward searching with the BWT.
Ok. Well I guess as long as we make sure it is very clear to users how the two files differ (unless @leoisl @iqbal-lab or @rmcolq see any issues). And maybe add a script into the repo that can convert between the two if that isn't too much trouble?
Ultimately I think we should move both gramtools and pandora to the same input - I think we spoke about rGFA right?
I'll happily make a script, or produce a second file, that is the same as the .prg
, if it's requested, but as the .bin
is only used by gramtools currently, i'd rather wait until it's needed.
Yes we did talk about using rGFA, but I guess would need work then for our tools to parse. I'll create an issue though so we remember this
Very sorry for the huge delay on answering this (just saw the mail as Brice merged these changes), last couple of weeks were complicated. All is fine with the PR by my POV, I just found a typo (th
instead of the
) in a comment, so nothing even deserving to open an issue to solve...
Answering some of the comments above, pandora
relies a lot on the textual representation of the PRG, and is susceptible even to the spaces between site markers and sequences. This means that any change to this textual representation will incur changes in pandora
. Changing from textual to binary representation will incur I guess lots of changes, but I do agree that the binary or rGFA representations are better than the textual one. Is it fine to keep producing this textual representation? Although there are better ways to represent the PRG, as already said, pandora
just works with the textual representation and we have lots of higher priority issues to solve right now. Unifiying both pandora
and gramtools
to parse the same representation (binary or rGFA) is indeed the proper solution, but this would incur lots of changes in pandora
for no performance or precision gain, so I'd say this is very low priority in pandora
development. Any changes (CLI, or whatever) is fine from pandora
perspective, as long as the textual representation keeps being produced.
That's what I though- no problem keeping the textual representation
As a PS, I pushed a bugfix to master as I let a bug slip in the binary file production (therefore only affected gramtools)
In this PR I propose two things:
(Should be invisible for non-gramtools people): Change slightly how the binary PRG string is represented; gramtools represents an A/T snp as 5A6T6 (notice the last char is a 6, not a 5). It was a pain to represent as 5A6T5, I had to run scripts to modify. We might want to have the same representation across pandora and gramtools, but right now the binary string is only used by gramtools anyway i think.
I propose removing the
-p/--prefix
argument tofrom_msa
subcommand, which enabled changing the 'sample name' of the output files, and adding in-o/--output_dir
and-n/--prg_name
arguments. I also propose no longer writing outmax_nesting
andmin_match_length
as prefixes to the prg output files. I find this much more intuitive to use.Example:
make_prg from_msa test.msa -o my_dir -n my_prg
makes filesmy_prg.prg
,my_prg.gfa
andmy_prg.bin
inside a directory calledmy_dir
. The parameters used are available inmy_prg.log
and in the fasta header ofmy_prg.prg
.Sensible defaults to
-o
and-n
are provided:-o
defaults to the directory where the MSA file is, and-n
to the stem of the MSA filename (here,test
)I'm keen to have discussion going on second point. For eg would it break existing workflows, and do we accept that price for the better usability i find is added here.