Closed leoisl closed 1 year ago
I am all for the index being merged into a single file. This is also somewhat related to #276
Totally related to #276 which I completely forgot about. We had some proposals there: single binary file, tar file, hdf5, etc. I think hdf5 would be the best implementation, but is too much work and a heavy dependency for something simple. Also makes it a bit more complicated for us to debug things/explore the index ourselves. All we want is to represent a set of files as a single self-contained file. I like the tar file approach, but we can't randomly access any single file inside a tar file, we can just do sequential access, which is breaking for our application. So I am leaning towards a zip file, which is like a tar file but with an index, allowing for random access of files inside the zip, and can also provide compression if we want. Also, is a very standard file format.
After searching through the various C/C++ libraries that can work with zip
files, I narrowed down to 3:
C++
input stream, and simple ways to write data. It will be very simple to read and write to zip archives with this. My only hesitation is that is a new repo and not widely used;C
library, so the interface is very C
-ish. It mostly involves using low-level functions to read and write data, so it will be slower to implement what we need with this library. But I think is a nice backup plan if option 1 does not work;C++
interface and is also heavily used and tested. The only issue is that it is a huge library that does a thousand of other things (think that is like boost)I think options 1 and 2 seem the best. Both seem to have good documentation too. Happy for you to make an executive decision based on which you'd rather work with...
Agreed. Go ahead and make your choice @leoisl
Ok, I am going for option 1 then as it is the quickest way to get this done and I can move to the most important issue https://github.com/rmcolq/pandora/issues/305
After working this afternoon on option 1 (1.5h just to include it and sort CMake issues!) I am not sure it is an option anymore. After reading a bit of the code and some issues, 64-bit zip files are not supported, which means our zip files can't have more than 4 GB. The 188k index, when zipped, has 659MB (7 GB unzipped). Linearly extrapolating, the 1M index would have ~3.5GB, very close to the limit. I think is not worth risking it, switching to a library that supports 64-bit zip files
Does libarchive support the 64-bit zip files?
Does libarchive support the 64-bit zip files?
I think I finally managed to get efficient zip writing/reading in pandora
. This is not ideal at all, and probably will need a refactoring in the future, but it is what we have right now. I definitely underestimated our requirements for efficiently saving and loading the index as a zip
file, or the C++ libraries I've looked at just wasn't there yet. So, our main requirements are:
zip
files. Otherwise we could simply write everything to disk, and then call a zip
command line from pandora
to zip everything. But this I think would defeat the purpose of avoiding the creation of several files, and would increase a lot disk requirements;So in the end I chose to use two zip libraries to support our use case: libarchive to write zip files (supports 1, 2 and 4) and libzip to read zip files (supports 2 and 3). This is quite annoying, and I am sure somewhere there is a library that supports everything... or we should have actually switched to hdf5...
Omg how annoying
Man that is super frustrating. Great work though.
Closed via https://github.com/rmcolq/pandora/pull/318
The
pandora
index is a set of files that need to live together: we need the.prg.fa
file (the PRGs themselves), the.prg.fa.kxx.wyy.idx
(the index itself), thekmer_prgs/**/*.kxx.wyy.gfa
files (the kmer graphs, one file/PRG). We will probably add another file with thePRG.SP_length
for each PRG (see https://github.com/rmcolq/pandora/issues/305). This structure is not ideal for very large PRGs, e.g. for our main use case in https://github.com/rmcolq/pandora/issues/304/, indexing will create ~1Mkmer_prgs/**/*.kxx.wyy.gfa
files, which makes it hard to distribute and move. Also when implementing https://github.com/rmcolq/pandora/issues/305, we will need to subsample the.prg.fa
file to only the PRGs that remain in the index, as well as re-structure the thekmer_prgs/**/*.kxx.wyy.gfa
files. This can quickly get messy.The
kmer_prgs/**/*.kxx.wyy.gfa
could well be transformed into a single multi gfa file, as we don't randomly access these gfas, only sequential access. Then we have in the end a handful of files, which we can combine all into a single zip file, so that we can easily access each file fast, and also for compression.pandora index
shall create these.zip
files and these will be consumed by the otherpandora
commands.pandora subindex
will read onepandora
index (.zip
file) and will create a subsampledpandora
index (another.zip
file). These files are self contained, sopandora subindex
will just need to modify internal files inside thiszip
and every data the otherpandora
commands need to map reads to the index will also be in this zip file.