iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Refactor pandora index into a single, self-contained file #306

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

The pandora index is a set of files that need to live together: we need the .prg.fa file (the PRGs themselves), the .prg.fa.kxx.wyy.idx (the index itself), the kmer_prgs/**/*.kxx.wyy.gfa files (the kmer graphs, one file/PRG). We will probably add another file with the PRG.SP_length for each PRG (see https://github.com/rmcolq/pandora/issues/305). This structure is not ideal for very large PRGs, e.g. for our main use case in https://github.com/rmcolq/pandora/issues/304/, indexing will create ~1M kmer_prgs/**/*.kxx.wyy.gfa files, which makes it hard to distribute and move. Also when implementing https://github.com/rmcolq/pandora/issues/305, we will need to subsample the .prg.fa file to only the PRGs that remain in the index, as well as re-structure the the kmer_prgs/**/*.kxx.wyy.gfa files. This can quickly get messy.

The kmer_prgs/**/*.kxx.wyy.gfa could well be transformed into a single multi gfa file, as we don't randomly access these gfas, only sequential access. Then we have in the end a handful of files, which we can combine all into a single zip file, so that we can easily access each file fast, and also for compression.

pandora index shall create these .zip files and these will be consumed by the other pandora commands. pandora subindex will read one pandora index (.zip file) and will create a subsampled pandora index (another .zip file). These files are self contained, so pandora subindex will just need to modify internal files inside this zip and every data the other pandora commands need to map reads to the index will also be in this zip file.

mbhall88 commented 1 year ago

I am all for the index being merged into a single file. This is also somewhat related to #276

leoisl commented 1 year ago

Totally related to #276 which I completely forgot about. We had some proposals there: single binary file, tar file, hdf5, etc. I think hdf5 would be the best implementation, but is too much work and a heavy dependency for something simple. Also makes it a bit more complicated for us to debug things/explore the index ourselves. All we want is to represent a set of files as a single self-contained file. I like the tar file approach, but we can't randomly access any single file inside a tar file, we can just do sequential access, which is breaking for our application. So I am leaning towards a zip file, which is like a tar file but with an index, allowing for random access of files inside the zip, and can also provide compression if we want. Also, is a very standard file format.

leoisl commented 1 year ago

After searching through the various C/C++ libraries that can work with zip files, I narrowed down to 3:

  1. Zipios: the easiest one to use by far. It abstracts lots of low-level details of handling archives, and gives us simple ways to read data from zip archives as a C++ input stream, and simple ways to write data. It will be very simple to read and write to zip archives with this. My only hesitation is that is a new repo and not widely used;
  2. libarchive: heavily used and tested, very stable. However, it is a C library, so the interface is very C-ish. It mostly involves using low-level functions to read and write data, so it will be slower to implement what we need with this library. But I think is a nice backup plan if option 1 does not work;
  3. POCO: I think is the best option: provides a C++ interface and is also heavily used and tested. The only issue is that it is a huge library that does a thousand of other things (think that is like boost)
mbhall88 commented 1 year ago

I think options 1 and 2 seem the best. Both seem to have good documentation too. Happy for you to make an executive decision based on which you'd rather work with...

iqbal-lab commented 1 year ago

Agreed. Go ahead and make your choice @leoisl

leoisl commented 1 year ago

Ok, I am going for option 1 then as it is the quickest way to get this done and I can move to the most important issue https://github.com/rmcolq/pandora/issues/305

leoisl commented 1 year ago

After working this afternoon on option 1 (1.5h just to include it and sort CMake issues!) I am not sure it is an option anymore. After reading a bit of the code and some issues, 64-bit zip files are not supported, which means our zip files can't have more than 4 GB. The 188k index, when zipped, has 659MB (7 GB unzipped). Linearly extrapolating, the 1M index would have ~3.5GB, very close to the limit. I think is not worth risking it, switching to a library that supports 64-bit zip files

mbhall88 commented 1 year ago

Does libarchive support the 64-bit zip files?

leoisl commented 1 year ago

Does libarchive support the 64-bit zip files?

It seems yes: https://github.com/libarchive/libarchive/blob/fa4b613f2e2510bd036f2eeed2fece97cd18b079/libarchive/archive_write_set_format_zip.c#L722-L738

leoisl commented 1 year ago

I think I finally managed to get efficient zip writing/reading in pandora. This is not ideal at all, and probably will need a refactoring in the future, but it is what we have right now. I definitely underestimated our requirements for efficiently saving and loading the index as a zip file, or the C++ libraries I've looked at just wasn't there yet. So, our main requirements are:

  1. Write directly to zip files. Otherwise we could simply write everything to disk, and then call a zip command line from pandora to zip everything. But this I think would defeat the purpose of avoiding the creation of several files, and would increase a lot disk requirements;
  2. Support large files. For this, the library needs to support 64-bit files. This took Zipios out of the options;
  3. Support random access. Our pandora index zip file will contain several files. For now, we have mainly the file with the minimizer hashes and 1 gfa file per PRG, representing the Kmer PRG. So we need to be able to access these files randomly. Unfortunately, libarchive does not support that, and I didn't realise this at the start. There is a PR from 2017 that tries to add the support: https://github.com/libarchive/libarchive/pull/963 , but I don't really know when we will get it. However, a new library, https://github.com/nih-at/libzip, supports random access and large files.
  4. Support stream writing. Some files we need to write are quite large, e.g. the minimiser index itself, and all the Kmer PRGs. We need a library that allows us to write just a small chunk of these files at a time, until we write them all. libzip unfortunately does not support stream writing. You basically open a zip file, and have to hold all the data you want to write to the zip file in RAM. This is definitely not an option for us. However, libarchive supports this.

So in the end I chose to use two zip libraries to support our use case: libarchive to write zip files (supports 1, 2 and 4) and libzip to read zip files (supports 2 and 3). This is quite annoying, and I am sure somewhere there is a library that supports everything... or we should have actually switched to hdf5...

iqbal-lab commented 1 year ago

Omg how annoying

mbhall88 commented 1 year ago

Man that is super frustrating. Great work though.

leoisl commented 1 year ago

Closed via https://github.com/rmcolq/pandora/pull/318