jltsiren / simple-sds

Simple succinct data structures (in Rust)
MIT License
45 stars 7 forks source link

`fetch()`-backed access from `wasm32-unknown-unknown` #15

Closed adamnovak closed 4 months ago

adamnovak commented 1 year ago

It would be nice to make https://github.com/vgteam/sequenceTubeMap able to dispense with its server, and just make byte-range queries against a standard Web server hosting a directory of files.

That would probably mean learning to read the GBWT lazily in the browser. So we would want to be able to build simple-sds for the wasm32-unknown-unknown target (since memory64 is still behind flags in major browsers at the moment).

The README says that usize needs to be 64-bit; is there a reason we can't just s/usize/u64/g the codebase (or at least most of it) and then build for a 32-bit usize platform?

The other part of this would be implementing the lazy access. Memory-mapped access already is implemented as a RawVectorMapper and IntVectorMapper, but the higher level data structures like BitVector, RankSupport and so on don't have their own -Mapper versions, and there's no way to swap out RawVector and Vec at the bottom of the stack for the memory-mapped versions.

There's also thus no way to swap out RawVector and Vec at the bottom of the stack for something that could stream the backing file over the network, even though I think we never actually need to deal in pointers into data. If the whole stack was made generic over a backend, like the libbdsg::yomo stuff in libbdsg is, it might be possible to swap out storage of data in normal memory for mediated access to data, and then let Rust code take care of paging in and out what is needed.

Then the question would be if the object hierarchy can all be materialized as needed: If we want to load a GBWT or whatever, would we need to make a few BitVectors or would we need thousands or hundreds of thousands? I'm assuming the latter, so we might need a way to have a virtual collection type, backed by a Simple-SDS vector, that could be used instead of live, fully-instantiated Vec collections in the data structures.

jltsiren commented 1 year ago

Rust standard library uses usize for sizes, offsets, and similar things, so it's not possible to avoid using it entirely. If the conversion from u64 to usize is lossy, there are many subtle things that may go wrong. I had to deal with similar issues in the late 2000s, and it's not something I want to do again. Especially because the whole situation with WebAssembly should fix itself in a year or two.

The "simple" part in Simple-SDS is largely about avoiding user-facing generic types. They were convenient in SDSL if you wanted to benchmark different parameterizations of the same data structure, but otherwise they were one of my least favorite features of the library. As Simple-SDS is basically "the parts of SDSL I use, in the way I prefer using them", that's not something I want to change.

There are no memory-mapped versions of bitvectors, mostly because the rank/select support structures for BitVector are optional. If they are not present in the file but the user needs them, they must be built by reading the entire bitvector. And, what may become relevant with GBWT/GBZ files, the support structures that are present could be from another library and hence incompatible with Simple-SDS. (I haven't implemented the mechanism for dealing with that yet.)

In the short term, using https://github.com/mlin/gfabase as the backend for Sequence Tube Maps should be the easiest option. If we need GBWT-specific functionality or if the database grows too large with future graphs, replacing it with a "gbzbase" should be easy enough.

I'm not sure about the total number of structures in a GBWT/GBZ file, but it's not large. The smallest GBZ files with all features enabled are less than 2 kilobytes, and the number of structures does not grow with the size of the data.

jltsiren commented 4 months ago

PR #18 kind of resolved the issue for this specific use case.