iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Make index a single, binary file #276

Closed mbhall88 closed 1 year ago

mbhall88 commented 3 years ago

...I was discussing with Zam that the index is more like a binary file, not supposed for users to view. However, the save and load functions in pandora transform it in a text format. We also have 1 kmer graph per PRG, so we also generate thousands of kmer graph files that are split in folders so that each folder has at most 4k kmer graph files... and there might be more information about the index we would like to store (like w and k). As it is a binary file, I was thinking that we could serialise the whole index object (and its kmer graphs) to the disk using for e.g. https://www.boost.org/doc/libs/1_74_0/libs/serialization/doc/index.html . I think it will be faster to read/write a single large file to disk than thousands of small files, we don't need to worry about dealing with huge number of files, whatever we add to the Index class will be automatically serialised and deserialised (no need to change anything), we don't need custom load/save functions (less code is better), etc... but this still falls in the code maintenance category, although it might improve performance a little bit... What are your opinions on this?

Originally posted by @leoisl in https://github.com/rmcolq/pandora/issues/225#issuecomment-686330911

mbhall88 commented 3 years ago

As we start to use Pandora in "application land", reducing the amount of filesystem bloat it produces is crucial. At the moment I am building a tool that uses Pandora and managing all of the index kmer graphs can be a pain. However, the most annoying thing is the rigidity of the index filename - which has to have the k-mer and window size encoded in the filename i.e. myprg.k15.w14.idx.

mbhall88 commented 3 years ago

This looks like a good resource: https://isocpp.org/wiki/faq/serialization

leoisl commented 3 years ago

Yeah, we definitely should not store metadata in the filename. I've slightly changed my opinion though. A single binary file to represent a set of binary files is fine, but this is not what the index is. The index itself is a binary file, but the satellite data (i.e. kmer PRGs GFAs, a metadata file we want to add, and any other files) are text data, that we can open and see, or load into bandage, etc. If we save it all as a single binary file, it is a bit inflexible. If we want to extract one kmer PRG GFA to check something with that kmer PRG, we will have to provide a CLI that gets this binary file as input and outputs the GFA. We will have to do this for any file we want to extract from the single binary index file. I am now favouring something like HDF5, where we will be able to represent the exact directory structure we have today, but in a single file. So we have the advantage of having just a single file instead of thousands of files, but we can still open the hdf5 file and navigate through the directory structure we have today. Do you have any opinions on this?

mbhall88 commented 3 years ago

I probably still favour a single binary file. As you say, we would need to add a CLI tool to extract/query things from the index, but that shouldn't be too difficult as it would just involve reusing the code to load the index anyway.
HDF5 is a pretty heavy dependency. Also, given GATB uses its own HDF5 I suspect there will be weird compiler/linker problems if we want to use our own HDF5 too (a la Boost).
But, you are the more experienced C++ dev so I am happy to concede to you if you feel there is a stronger case for HDF5.

leoisl commented 3 years ago

I am not much more experienced C++ dev than you, and experience does not count much here I think. Arguments matter a lot more, so don't concede if you disagree. I think counter-argumenting me will be more productive if you disagree.

Yeah, it is a heavy dependency, but hopefully Hunter will manage all for us with a some 2 or 3 lines in CMakeLists.txt. You are right about the GATB conflict. We link GATB statically to our binary, which means I think HDF5 libraries are also linked statically... if we want to use HDF5 in pandora, it seems then we are forced to use the GATB's HDF5. I am not sure, we have to test.

In summary, I think that using HDF5 is a better approach to organise a hierarchy of files in a single file than just adding everything to a single binary file, but at the same time tooling can get complicated (or not). If for some reason, tooling is easy (i.e. Hunter takes care of everything and no conflicts with GATB), then I vote to used HDF5. Otherwise, I guess we avoid it...

rant time: I would be so happy with C++ if it just had the Rust or python tooling, would not even bother with the other advantages of Rust. I now think time invested in tooling is a bit of a waste, it is a problem that has already very good solutions in this field...

mbhall88 commented 3 years ago

Even if Hunter can handle it in a couple of lines, won't it increase compile time quite a bit? That's more what I was referring to.

iqbal-lab commented 3 years ago

Is the life cycle to read once, process in ram, make new local prg "files" modifying the main file, then read them again? With many processes simultaneously reading many sub parts of the file? It's worth thinking through the use patterns

This blog is interesting also

https://cyrille.rossant.net/should-you-use-hdf5/

leoisl commented 3 years ago

Even if Hunter can handle it in a couple of lines, won't it increase compile time quite a bit? That's more what I was referring to.

In my opinion (which could be wrong), I don't think we should worry much about compile time, as few users will compile from source. We provide precompiled binaries, conda (very soon), and containers, and these 3 options there is no need for the user to compile. And if there is the need, it is just one time they will need to do it. For developers, we just download/compile the library once, and then no need to do it anymore, we just link it. IIRC, HDF5 does not have high compile time neither, although I am not sure anymore we should go for it...

Is the life cycle to read once, process in ram, make new local prg "files" modifying the main file, then read them again? With many processes simultaneously reading many sub parts of the file? It's worth thinking through the use patterns

I am not sure we should think about a specific life cycle. We want to use a single file, instead of thousands of files for index. We also have the same issue of thousands of files for map, discover, compare, etc... which we will want to improve. I think we should think of this as a general way of representing a tree structure of files in a single file, and threads being able to read and write to (subparts of) this file.

This blog is interesting also

https://cyrille.rossant.net/should-you-use-hdf5/

Yeah, I did not know of these shortcomings of HDF5...

Options

Ok, I think is productive to rewind and restate our options, and try to list the good and bad points of each of them. I will add a new approach, which is a middle ground between single binary file, and HDF5. And I think we should remember that we will implement this approach not only in index, but also in compare, map, and discover. We have a spectrum from "no dependency, but having to implement lots of stuff" to "heavy dependency, but almost everything is done":

1. Single binary file (no dependency, but having to implement lots of stuff).

2. Tar file (middle-ground)

3. HDF5 (heavy dependency, but not having to implement lots of stuff)

My opinion

I think I missed some bad and good points for the 3 approaches, please write if you think of one. The more approaches and arguments we have, the better will be our choice.

I am leaning towards the tar approach. It is a very standard file format, does exactly what we want (represent a tree structure of files in a single file), and there are lightweight APIs that do some stuff already, so we don't need to implement many things. There are several other APIs to work with tar. Probably many things we need to do with tar is already implemented in some API.

The single binary approach gives us the most control, as we will implement and test every feature we need. But I think we have more important features to add in pandora, so I prefer to not implement stuff if we can use another approach that has libraries that have already lots of stuff implemented that we can just use. IMO this is the best approach if we had lots of developers and time, but it is not the case...

The HDF5 approach is the cleanest, I think, from an engineering perspective... but our lack of experience in it is what now makes me hesitate to go towards it.

My vote is towards the tar approach. What are yours?

iqbal-lab commented 3 years ago

thats a long post, i'll split replies as i process it. But this:

"I am not sure we should think about a specific life cycle. We want to use a single file, instead of thousands of files for index."

i disagree with. If this is a storage format that we write to in one blob and read from in one blob, that affects what we need of a library. If we want rapid repeated random access to bits inside it, then that also tells us something. If we want to repeatedly modify those files, that also is something. How well hdf5 or other things support these options and support multithreading will affect how well they do what we need.

For comparison, look at the problems we had with multithreading gatb

leoisl commented 3 years ago

i disagree with. If this is a storage format that we write to in one blob and read from in one blob, that affects what we need of a library. If we want rapid repeated random access to bits inside it, then that also tells us something. If we want to repeatedly modify those files, that also is something. How well hdf5 or other things support these options and support multithreading will affect how well they do what we need.

yep, makes sense. I now also disagree with myself

mbhall88 commented 3 years ago

The single binary file option doesn't have to be all implemented by hand. We can just use one of the many serialization/deserialization (serde) libraries. Yes, we would need to tell it how to serde, but there probably isn't as much work in doing that as most libraries have this implemented for lots of data structures.

Some popular libraries:

iqbal-lab commented 3 years ago

Right now, after chatting with @mbhall88 this morning, i think i agree with Michael that a single immutable index file containing the actual prg index makes a lot of sense, with a header encoding key information (eg w,k). Command-line args never need to mention w,k after initially specifying them to build the index.

The directory structure containing MSAs, and "whatever stuff make-prg uses to allow rapid update" does need to exist and will need repeatedly to be accessed and written to, but this is not the index (and i think not what michael was talking about). So there is a secondary question: do/should we do something to wrap/compress this so we don't have so much file I/O. I think that might be separate from the original issue here, unless someone feels/argues it is important these are all wrapped into one file. hdf5 makes me nervous, and it's not obvious to me the index and the MSAs-etc need to be in one object.

leoisl commented 3 years ago

Yeah, IMO you are both right, and I was missing some obvious engineering points:

  1. Using immutable binary files simplifies a LOT the implementation and we have for sure better performance (i.e. if we need to update the index, for example, we don't update the index in place - we create a new updated index file). This also makes sense. When we run commands, and we give inputs, we want new outputs and our inputs intact. It is better than having the command changing our input. So, there is no need for our binary files to be mutable, let's keep them immutable, as you said. I was thinking on good ways to have mutable binary input files, there is no need for that;

  2. It is critical for scalability that we must be able to load just part of the binary file into RAM, not the whole binary file. I was looking for libraries that would allow this, and just forgot the essential: many libraries are compatible with mmap. If we can mmap a binary file, then the OS itself will take care of this for us, i.e. it will just load the parts of the file we need when we access them. With mmap, we can even mutate the binary file, and it shall work fine. The issue with mmap is that it seems it might work differently on osx... so we will need maybe conditional compilation for linux and osx. It is going to be a real pain to always have to support osx....

Given these two things, and your other arguments, I am now considering the single binary file as the best option. What really got my attention is flatbuffer, posted by @mbhall88 , specially this:

Access to serialized data without parsing/unpacking - What sets FlatBuffers apart is that it represents hierarchical data in a flat binary buffer in such a way that it can still be accessed directly without parsing/unpacking, while also still supporting data structure evolution (forwards/backwards compatibility).

So, it is basically a simplified HDF5 in a single binary file, and it was developed for game development and other performance-critical applications. The benchmarks are great, and I think this really satisfy all our needs. It is also compatible with mmap:

Memory efficiency and speed - The only memory needed to access your data is that of the buffer. It requires 0 additional allocations (in C++, other languages may vary). FlatBuffers is also very suitable for use with mmap (or streaming), requiring only part of the buffer to be in memory.

So my vote is now on single binary files, using flatbuffer.

mbhall88 commented 3 years ago

It is critical for scalability that we must be able to load just part of the binary file into RAM

Don't we have to load the whole index into memory regardless? i.e. in order to be able to quasi-map reads to it.

So my vote is now on single binary files, using flatbuffer.

I haven't looked extensively at the libraries I posted, but I guess anything maintained by google is probably going to be performant and well documented. Happy to use this too if you feel it is the most suitable

iqbal-lab commented 3 years ago

Michael is strictly talking about the index and i think leandro is talking about other bits. Future scaling of compare may involve a subgraph (eg one gene) in ram but analysed across thousands of samples. So a multicoloured gene on a cluster node. Scaling to many samples across all genes would use too much RAM. But this is about the kmer graph, not the index. Or leandro might be talking about doing discover one gene at a time?

leoisl commented 3 years ago

No, I am just wrong. What Michael said is correct. Right now, we need to load the whole index, along with all Kmer PRGs into RAM in order to quasi-map reads, which is what map, discover and compare do. Well, we could improve RAM on this by mmap-ing the binary file and just loading the pages of the file on demand when we need it, but that seems an unnecessary overoptimisation. Another improvement we could do also is that compare writes hundreds or thousands of files when genotyping, but we don't even need a library to fix this. So, yeah, we don't need any fancy library, just put all index info in a single binary file, and always load it entirely into RAM.

mbhall88 commented 3 years ago

Michael is strictly talking about the index and i think leandro is talking about other bits. Future scaling of compare may involve a subgraph (eg one gene) in ram but analysed across thousands of samples. So a multicoloured gene on a cluster node. Scaling to many samples across all genes would use too much RAM. But this is about the kmer graph, not the index. Or leandro might be talking about doing discover one gene at a time?

To be explicit, when I say index, I mean the index file plus the kmer graphs (GFAs)

leoisl commented 2 years ago

This is needed for scalability, added label

leoisl commented 1 year ago

Closed via https://github.com/rmcolq/pandora/pull/318