Open d4tocchini opened 6 years ago
A pure JS implementation is possible, but would require rewriting the EMS C library in JS. A partial list of that functionality includes:
This last one isn't strictly necessary, but without persistent storage the application would need to re-create the in-memory dataset by reading the original data from disk every time the program was run. For datasets which are hundreds of gigabytes in size this is impractical. EMS' implementation directly leverages the OS' physical memory management so the application can be restarted with no time penalty. Specifically, if the EMS data is already anywhere in memory for any reason (i.e.: in an OS buffer cache, or the application is still running), when the application is executed again the existing copy of the data is accessed without going to storage. In fact, the fastest way to load an entire EMS dataset from persistent storage (f.e.: after a reboot) is to copy the EMS file to /dev/null
Node's typedarray is meant to provide low-level mechanisms that exist principally to make Emscripten possible. C/C++ operate on virtual linear addresses, which is what typedarray buffers provide. It's fine for a compiler target, but not idiomatically useful for interoperating with JS variables.
to clarify, i wasnt looking for pure JS implementation, that sounds like a lot of effort without the payoff. rather, something like an EMS_TYPE_BUFFER. the most common situation being reading and writing img/media buffers. what would you suggest for this use-case, stringify base64 or not use it for media?
beyond that, buffer-like types would eliminate JSON parse/stringify serialization, a non-trivial impact especially if wanting to take advantage of the roomier 64bit address space. i'm assuming buffers would be a more performant data transfer across the native/node barrier when compared to JSON strings, yes?
on this note, the webassembly memory interface provides an autogrowing paged memory buffer:
mem = new WebAssembly.Memory({initial: page_count} // 64kib pages
u8view = new Uint8Array(mem.buffer)
view = new DataView(mem.buffer)
f32 = view.getFloat32(ptr)
...
if (needed) {
mem.grow(page_increment)
view = new DataView(mem.buffer)
}
...
although built for zero-copy wasm transport, it's just a buffer with autogrowing API that has the wasm win-win if & when. buffers offer a lot potential perf wins in vanilla JS land. this single growable buffer pool makes typedArrays more flexible and gives a dynamic escape from garbage collection issues. data kept in buffers (if ergonomically possible) will always conserve cpu & ram compared to parsing & allocating nested JSON. with a little effort you have a more natural foundation for tapping the GPU and in-mem columnar data fun (https://github.com/jpmorganchase/perspective/tree/master/packages/perspective). and, the wasm world is homegrowing an impressive OSS toolset as well. basically, it opens up ems to a larger, perf-oriented ecosystem.
as general principle of efficiency, i prefer keeping as much data as possible as raw as possible. replacing traditional JS allocations with pointer-like buffer indices minimizes v8 deopts by making functions intrinsically more monomorphic.
don't get me wrong, i'm 100% picking up what you're putting down, you clearly demonstrated a better future than workers + shared array buffers. Is there something intrinsically incompatible with ems and buffers? If not, then we should and any advice to help me patch existing codebase?
BTW, I have ems built & working for:
electron: "4.0.0-nightly.20181010"
chrome: "69.0.3497.106"
modules: "64"
napi: "3"
node: "10.11.0"
v8: "6.9.427.24"
it's super impressive, you're the man. i'm shocked this is possible and not the de facto... should i submit a PR?
From the description of your use case it doesn't sound like EMS brings anything to the table, and of course encoding binary data as a base64 string would be unwanted overhead. EMS presently exposes JSON data types for which copy-in/copy-out from the JS runtime is, by definition, unavoidable. The EMS implementation already does store arbitrary byte vectors so adding the EMS_TYPE_BINARY
you describe would be straightforward and would be nearly identical to the EMS_TYPE_STRING
implementation.
Resizing a typedarray buffer is another matter as it would need to be done by EMS, not JS, in order to be parallel-safe. To that end, JS is free to relocate typedarray buffers just like any other heap data, meaning they are not safe for parallel access. SharedArrayBuffers
are documented as unstable, and progress on defining a parallel memory model has been stalled for years.
If you're "shocked" EMS is not part of Node Workers, imagine how I feel about Node Workers' authors deleting my recommendations from their request for comments.
If you submit a PR that converts EMS to NAPI, and/or a PR that adds EMS_TYPE_BINARY
I would be happy to merge it.
wondering if it's within reasonable effort to read/write typedarrays/buffers? I can see everything goes through JSON stringify & parse, is this a hard constraint or just convention?
@mogill, are you implying here https://github.com/SyntheticSemantics/ems/issues/20 that there's something inherently wrong with buffers in node or just sharedbuffers? ems far outshines anything possible with master-worker shared mem, would love to hear a few more of your thoughts on all this...