jltsiren / simple-sds

Simple succinct data structures (in Rust)
MIT License
47 stars 8 forks source link

Best way to use memory mapping with a SparseVector #8

Open GSGerritsen opened 3 years ago

GSGerritsen commented 3 years ago

Hello,

I'm hoping to do something like this:

struct Example {
    ... misc fields,
   path: PathBuf
}

Where path points to the location of a serialized SparseVector. Is it possible to use MemoryMap in some way with SparseVector such that I can do something like: let sparse_vector = SparseVector::load(memory_mapped_sparse_vector)?

I'm hoping to be able to do things like sparse_vector.rank(10) etc without having to load in a potentially huge SparseVector in its entirety from disk.

In addition, I'd also like to be able to append to the SparseVector - is this also possible to do with memory mapping where I can avoid loading the whole thing, writing to it, then serializing it back?

Thanks in advance for any guidance!

jltsiren commented 3 years ago

Implementing a memory-mapped SparseVector would be straightforward but tedious. You can duplicate code, much in the same way as RawVectorMapper and IntVectorMapper duplicate the in-memory RawVector and IntVector. Alternatively, you can use generics in BitVector and SparseVector to have them use either in-memory or memory-mapped backing structures, but the code will probably get quite ugly.

Appending to a SparseVector is possible – SparseBuilder already does that for in-memory structures. First you choose divisor d, which is hardcoded to be a power of 2. The optimal choice depends on knowing the density of values in advance. Then you append the values in increasing order:

However, due to the way the structures have been implemented, you would have to use two separate files, with a RawVectorWriter for high and IntVectorWriter for low, and concatenate the files afterwards. Also, queries require select support structures for high, and those structures are immutable. Mutable alternatives do exist, but their time/space trade-offs are worse. Finally, the current memory mapping implementation assumes read-only files, and even my plans for mutable files assume that the size of the file does not change.

The overall design of simple-sds reflects how I have been using SDSL data structures in bioinformatics applications. In-memory data structures are assumed to be gigabytes or tens of gigabytes in size and often shared by tens of threads. The writer structures are intended for writing temporary files, while memory-mapped structures are used for reading them.