jltsiren / simple-sds

Simple succinct data structures (in Rust)
MIT License
45 stars 7 forks source link

Enable build on `wasm32-wasi` #17

Closed adamnovak closed 4 months ago

adamnovak commented 7 months ago

For https://github.com/vgteam/sequenceTubeMap/issues/379 I'm trying to get gbz-base to build for WebAssembly. But it doesn't at the moment, because simple-sds can't. Here's the first 8 errors it throws up:

error[E0433]: failed to resolve: could not find `unix` in `os`
  --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:59:14
   |
59 | use std::os::unix::io::AsRawFd;
   |              ^^^^ could not find `unix` in `os`

error[E0425]: cannot find value `PROT_READ` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:430:44
    |
430 |             MappingMode::ReadOnly => libc::PROT_READ,
    |                                            ^^^^^^^^^ not found in `libc`

error[E0425]: cannot find value `PROT_READ` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:431:43
    |
431 |             MappingMode::Mutable => libc::PROT_READ | libc::PROT_WRITE,
    |                                           ^^^^^^^^^ not found in `libc`

error[E0425]: cannot find value `PROT_WRITE` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:431:61
    |
431 |             MappingMode::Mutable => libc::PROT_READ | libc::PROT_WRITE,
    |                                                             ^^^^^^^^^^ not found in `libc`

error[E0425]: cannot find function `mmap` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:433:34
    |
433 |         let ptr = unsafe { libc::mmap(ptr::null_mut(), len, prot, libc::MAP_SHARED, file.as_raw_fd(), 0) };
    |                                  ^^^^ not found in `libc`

error[E0425]: cannot find value `MAP_SHARED` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:433:73
    |
433 |         let ptr = unsafe { libc::mmap(ptr::null_mut(), len, prot, libc::MAP_SHARED, file.as_raw_fd(), 0) };
    |                                                                         ^^^^^^^^^^ not found in `libc`

error[E0425]: cannot find function `munmap` in crate `libc`
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:491:27
    |
491 |             let _ = libc::munmap(self.ptr.cast::<libc::c_void>(), self.len);
    |                           ^^^^^^ not found in `libc`

   Compiling rusqlite v0.29.0
error[E0599]: no method named `as_raw_fd` found for struct `File` in the current scope
   --> /Users/anovak/.cargo/git/checkouts/simple-sds-95484d45b95fb50d/c2f8637/src/serialize.rs:433:90
    |
433 |         let ptr = unsafe { libc::mmap(ptr::null_mut(), len, prot, libc::MAP_SHARED, file.as_raw_fd(), 0) };
    |                                                                                          ^^^^^^^^^ method not found in `File`
   --> /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/os/fd/raw.rs:65:8
    |
    = note: the method is available for `File` here
    |
    = help: items from traits can only be used if the trait is in scope
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
    |
53  + use std::os::fd::AsRawFd;
    |

I think I need to:

jltsiren commented 7 months ago

I could make memory mapping a feature that is enabled by default but can be disabled. However, the bigger issue is that simple-sds leans heavily on the assumption that usize and u64 are the same. Many low-level things will probably break with 32-bit integers.

Additionally, Rust uses usize for array indexing, which means that it's difficult to use arrays larger than 2^32 in a 32-bit environment. We can't import GBZs with more than ~4.29 Gbp of sequence, such as human graphs built with PGGB or full Minigraph–Cactus graphs. We also can't import GBZs where the run-length encoded BWT is larger than 4 GB, such as 1000GP graphs (and possibly final HPRC graphs with 700+ haplotypes).

adamnovak commented 7 months ago

We might be able to get away with the max size limitations. I was thinking we'd convert from GBZ to database outside the browser, so all we really need is to be able to properly decode the blobs in the database files.

And if we want to use the databases in the browser, and if we need simple-sds to decode the blobs in the databases, then I don't know if there's an alternative to painstakingly unwinding the assumption that usize is u64 in the code that actually implements the data structures.

I managed to get simple-sds to build for wasm32-wasi with liberal use of #[cfg(not(target_family = "wasm"))]. Hopefully once I can get the full gbz-base binaries to link and load right I can start identifying places where the two builds can't agree on serialized representations.

jltsiren commented 7 months ago

I don't think gbz-base will need anything from simple-sds once the database has been built. The sizes and identifiers of individual objects should fit in 32 bits, because we often do that in vg as well. The blobs are encoded either using gbwt::support::ByteCode / gbwt::support::RLE, which don't care about the size of usize as long as the numbers fit in it, or an internal encoding that packs three bases in a byte.

jltsiren commented 4 months ago

I think PR #18 also resolved this.