Closed leoisl closed 1 year ago
Thanks for the comments, both were addressed! However, I am hesitant to pre-release this, as I've not tested on real data, and we're delaying writing unit tests for the new code for later. The main contribution of this PR is to provide infrastructure to the lazy loading of PRGs feature, which is essential to roundhound. This feature should be finished somewhere next week, and then I will test it with roundhound dataset, and will be more confident on pre-releasing it.
However, I am hesitant to pre-release this, as I've not tested on real data, and we're delaying writing unit tests for the new code for later.
Sure. Do a pre-release whenever you think it makes sense (i.e. when there is a complete set/functionaly for the new features)
This PR changes the pandora index from a set of files in a directory structure to a single, compressible and indexable
zip
file (pandora
indexes now have the suffix.panidx.zip
). This is now the single file that is produced by thepandora index
command and is required as argument to all the otherpandora
commands. This index is self contained in the sense that it encodes all the information and metadata about it (e.g. which PRGs were used to create it, window and kmer size, etc). This new index provide the infrastructure for the next features and simplifies working with large reference pangenome collections, with a few million PRGs. These changes will be released aspandora v0.11.0
.Closes https://github.com/rmcolq/pandora/issues/308 https://github.com/rmcolq/pandora/issues/307 https://github.com/rmcolq/pandora/issues/306
Breakdown of main changes
Sorry that this is another big PR, but half of the changes can be ignored as they are just updating the example data. Here is a breakdown of the main changes:
example/
dir can be ignored, I just updated the example files to the latest version;CMakeLists.txt
);*_main(.h/.cpp)
files were changed to receive a pandora index (*.panidx.zip
) instead of a fasta file containing PRGs;compare_main(.h/.cpp)
,map_main(.h/.cpp)
,discover_main(.h/.cpp)
,seq2path_main(.h/.cpp)
all changed in a similar way to remove CLI parameters-w
and-k
(we now get this metadata from the index);pandora
index implementation (index(.h/.cpp)
);localPRG_reader(.h/.cpp)
);merge_index
subcommand (merge_index_main(.h/.cpp)
);zip_file(.h/.cpp)
);Changelog of next release
[0.11.0-alpha.0]
Changed
pandora
index changed from a set of files in a directory structure to a single, compressible and indexablezip
file (pandora
indexes now have the suffix.panidx.zip
). This is now the single file that is produced by thepandora index
command and is required as argument to all the otherpandora
commands. This index is self contained in the sense that it encodes all the information and metadata about it (e.g. which PRGs were used to create it, window and kmer size, etc). This new index provide the infrastructure for the next features and simplifies working with large reference pangenome collections, with a few million PRGs. This new index breaks backwards compatibility with previouspandora
versions. The structure of this zip archive is as follows:_prgs
: The PRGs themselves used as input to create this index;_prg_names
: The names of the PRGs;_prg_min_path_lengths
: the length of the shortest path through each PRG;_minhash
: the minimizer hash data structure;_metadata
: metadata about the index (first line is window size, second is kmer size);*.gfa
: the several GFA files describing the minimizing kmer graph for each PRG;C++11
toC++14
;Removed
-w
and-k
from the followingpandora
subcommands:compare
,discover
,map
,seq2path
;merge_index
subcommand;Fixed
pandora
index implementation;Added