iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Changing pandora index format #318

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

This PR changes the pandora index from a set of files in a directory structure to a single, compressible and indexable zip file (pandora indexes now have the suffix .panidx.zip). This is now the single file that is produced by the pandora index command and is required as argument to all the other pandora commands. This index is self contained in the sense that it encodes all the information and metadata about it (e.g. which PRGs were used to create it, window and kmer size, etc). This new index provide the infrastructure for the next features and simplifies working with large reference pangenome collections, with a few million PRGs. These changes will be released as pandora v0.11.0.

Closes https://github.com/rmcolq/pandora/issues/308 https://github.com/rmcolq/pandora/issues/307 https://github.com/rmcolq/pandora/issues/306

Breakdown of main changes

Sorry that this is another big PR, but half of the changes can be ignored as they are just updating the example data. Here is a breakdown of the main changes:

Changelog of next release

[0.11.0-alpha.0]

Changed

Removed

Fixed

Added

leoisl commented 1 year ago

Thanks for the comments, both were addressed! However, I am hesitant to pre-release this, as I've not tested on real data, and we're delaying writing unit tests for the new code for later. The main contribution of this PR is to provide infrastructure to the lazy loading of PRGs feature, which is essential to roundhound. This feature should be finished somewhere next week, and then I will test it with roundhound dataset, and will be more confident on pre-releasing it.

mbhall88 commented 1 year ago

However, I am hesitant to pre-release this, as I've not tested on real data, and we're delaying writing unit tests for the new code for later.

Sure. Do a pre-release whenever you think it makes sense (i.e. when there is a complete set/functionaly for the new features)