Closed bmaranville closed 2 months ago
I did test this code out, and it works - I am able to load a local file and read values, keys, attributes, etc. h5wasm, without reading the whole file contents into memory.
I don't know how to bundle a worker script including all dependencies (it seems kind of tricky), which is why it's written in plain JS with importScripts above.
NOTE:
The esm
build of h5wasm (dist/esm/hdf5_hl.js) already has WORKERFS support built in. (The iife
version used in the worker above is built by further bundling the esm
build.)
Thanks for opening an issue to track this! :100:
I don't know how to bundle a worker script including all dependencies (it seems kind of tricky), which is why it's written in plain JS with importScripts above.
Yeah last time I tried, this is where I hit a wall. Vite has improved a lot since then, so I'll try again asap.
Yes, I think things have improved! I replaced the importScripts()
above with regular imports, and I was able to build a functioning worker bundle with this esbuild invocation:
esbuild --format=esm --bundle worker.ts > worker_bundle.js
// h5wasm_worker.ts
import * as Comlink from "comlink";
import h5wasm from "h5wasm";
After a whole morning of hair pulling:
import
instead of importScript
inside the worker, the promises returned by the Comlink-proxied functions never resolve... This is completely silent; Vite's dev server doesn't show any error and there are no errors in the browser console. I found a way around the issue by instantiating the Worker
constructor with { type: "module" }
. Vite's documentation on Web Wokers seems to say that import
can be used inside workers regardless of { type: "module" }
, so I'm not sure what's going on. The problem is that { type: "module" }
remains in the build output and Firefox added support for it only in v114, which is quite recent.Yes, I think I found that { type: 'module' }
was important as well when using a worker with import
statements in it.
On the other hand, once the worker is bundled with esbuild or other as above, it no longer contains any import statements, and is usable in any context I would think.
For electron-like apps (VS Code?) I imagine you can use { type: 'module' }
, and maybe use a bundled worker for more general browser contexts (for now?)
I have been playing around with this... would it be useful to include a special build of h5wasm that uses a worker in the main h5wasm package?
Here is a setup that works:
// lib_worker.ts
import * as h5wasm from 'h5wasm';
const WORKERFS_MOUNT = '/workerfs';
async function save_to_workerfs(file) {
const { FS, WORKERFS, mount } = await workerfs_promise;
const { name: filename, size } = file;
const output_path = `${WORKERFS_MOUNT}/${filename}`;
if (FS.analyzePath(output_path).exists) {
console.warn(`File ${output_path} already exists. Overwriting...`);
}
const outfile = WORKERFS.createNode(mount, filename, WORKERFS.FILE_MODE, 0, file, file.lastModifiedDate);
return output_path;
}
async function _mount_workerfs() {
const { FS } = await h5wasm.ready;
const { filesystems: { WORKERFS } } = FS;
if (!FS.analyzePath(WORKERFS_MOUNT).exists) {
FS.mkdir(WORKERFS_MOUNT);
}
const mount = FS.mount(WORKERFS, {}, WORKERFS_MOUNT);
return { FS, WORKERFS, mount };
}
const workerfs_promise = _mount_workerfs();
export const api = {
ready: h5wasm.ready,
save_to_workerfs,
H5WasmFile: h5wasm.File,
Dataset: h5wasm.Dataset,
Group: h5wasm.Group,
Datatype: h5wasm.Datatype,
BrokenSoftLink: h5wasm.BrokenSoftLink,
}
// worker.ts
import * as Comlink from 'comlink';
import { api } from './lib_worker';
Comlink.expose(api);
// worker_proxy.ts
import * as Comlink from 'comlink';
import type { api } from './lib_worker.ts';
import { ACCESS_MODES } from './hdf5_hl.ts';
import type { File as H5WasmFile, Group, Dataset, Datatype, BrokenSoftLink } from './hdf5_hl.ts';
export type { H5WasmFile, Group, Dataset, Datatype, BrokenSoftLink };
type ACCESS_MODESTRING = keyof typeof ACCESS_MODES;
const worker = new Worker('./worker.js');
const remote = Comlink.wrap(worker) as Comlink.Remote<typeof api>;
export class GroupProxy {
proxy: Comlink.Remote<Group>;
file_id: bigint;
constructor(proxy: Comlink.Remote<Group>, file_id: bigint) {
this.proxy = proxy;
this.file_id = file_id;
}
async keys() {
return await this.proxy.keys();
}
async paths() {
return await this.proxy.paths();
}
async get(name: string = "/") {
const dumb_obj = await this.proxy.get(name);
// convert to a proxy of the object:
if (dumb_obj?.type === "Group") {
const new_group_proxy = await new remote.Group(dumb_obj.file_id, dumb_obj.path);
return new GroupProxy(new_group_proxy, this.file_id);
}
else if (dumb_obj?.type === "Dataset") {
return new remote.Dataset(dumb_obj.file_id, dumb_obj.path);
}
else if (dumb_obj?.type === "Datatype") {
return new remote.Datatype();
}
else if (dumb_obj?.type === "BrokenSoftLink") {
return new remote.BrokenSoftLink(dumb_obj?.target);
}
return
}
}
export class FileProxy extends GroupProxy {
filename: string;
mode: ACCESS_MODESTRING;
constructor(proxy: Comlink.Remote<H5WasmFile>, file_id: bigint, filename: string, mode: ACCESS_MODESTRING = 'r') {
super(proxy, file_id);
this.filename = filename;
this.mode = mode;
}
}
export async function get_file_proxy(filename: string, mode: ACCESS_MODESTRING = 'r') {
const file_proxy = await new remote.H5WasmFile(filename, mode);
const file_id = await file_proxy.file_id;
return new FileProxy(file_proxy, file_id, filename, mode);
}
export async function save_file(file: File) {
const { name, lastModified, size } = file;
console.log(`Saving file ${name} of size ${lastModified} to workerfs...`);
return await remote.save_to_workerfs(file);
}
Which is then built with these two esbuild commands:
npx esbuild --format=esm --bundle worker.ts > worker.js;
npx esbuild --format=esm --bundle worker_proxy.ts > worker_proxy.mjs;
The resulting library can be used by importing { save_file, get_file_proxy }
from worker_proxy.mjs
, then use save_file
with a user-selected File object from a file input (providing random access to that file without reading it first), and getting a FileProxy
object with get_file_proxy(filename)
. Async get (retrieve proxy to Dataset, another Group, etc.) works as expected, and once you have e.g. a Dataset proxy await dset_proxy.value
returns the value.
The reason there are three files instead of two is that it's difficult to build lib_worker.ts
into a worker directly with export { api }
at the bottom, but you really want that export so you can use the types in worker_proxy.ts
, so then worker.ts
is just a thin wrapper that exposes api
.
Wow, this is brilliant! Definitely a nice approach.
I'll try to get my head around it a bit more to understand how this will fit into the existing H5Web provider code (notably with loading compression plugins) but either way, we can iterate. I'm planning on providing a separate provider, maybe H5WasmLocalFileProvider
, to give us time to experiment and make for a smoother transition. We can then imagine another H5WasmRemoveFileProvider
that would perform range HTTP requests with a similar set-up.
I moved these files to https://github.com/usnistgov/h5wasm/pull/70 and I added a method for writing bytes to a MEMFS file, which I used for loading a plugin in the example code there. It's still a bit awkward, and I'm realizing there's really no use case for interacting with the filesystems within the worker except to load files and plugins, so I might rejigger my API so that e.g. a save function returns an H5WasmFile proxy instead of just a file path on success, and it would make sense to create a few API functions for loading plugin files and maybe listing the contents of the plugin folder.
FYI, this is still on my mind. Last time I played with h5wasm-worker and tried to create an H5WasmLocalFileApi
, I ran into a wall: async calls to the comlink-wrapped web worker were not resolving; the promises would remain in pending state; no errors, nothing... I need another approach, I think: maybe starting from your code examples and building up.
Is your code in a branch somewhere? I'd be happy to help debug.
With some fresh eyes, I think it may be caused by h5wasm
being bundled into h5wasm-worker
. I had moved h5wasm
to peerDependencies
in h5wasm-worker
but forgot to also configure ESBuild to mark it as external
. Can't believe this is not done automatically :sob: — I'll report back.
Hmm but it has to be bundled into h5wasm-worker
, that's the whole point. Maybe somehow I'm using h5wasm
directly in H5WasmLocalFileApi
instead of via h5wasm-worker
...
I've opened a test
branch on the h5wasm-worker
repo with a basic index.html
and index.js
file. As you'll see in my comments, execution doesn't go past the first await
statement. To reproduce: npm install
, npm run build
and npx serve
. Am I missing something?
The problem seems to come from esbuild-plugin-inline-worker
. If I compile and load the worker "manually", it works fine — cf. branch test-2
.
I've made progress back in this repo by embracing new Worker(new URL(...), { type: "module" })
. With this syntax, the promises of the wrapped TS worker resolve as expected. :tada: I guess I'll just have to add some sort of feature detection to make sure H5Web consumers can easily fall back to the MEMFS implementation.
I'll push forward, taking inspiration from what you've done in h5wasm-worker
. I have a feeling that typing strictness is going to be a challenge with the proposed API, but I'll report back once I've made more progress.
Yes - you have found the issue. I was about to write back to you. In fact, I think we can remove the reference to import.meta in the h5wasm module which will remove the need for {type: "module"}
, using a compile flag for emscripten.
Can you try now, without the {type: "module"}
option? I published a new version of h5wasm
and updated the dependency in h5wasm-worker
I rebased the test
branch in h5wasm-worker
and it worked. :tada: Unfortunately, I still ran into pending promises when I tried h5wasm-worker
in H5Web...
Then, I implemented a very dumb worker in H5Web with new Worker(new URL(...), { type: "module" })
. Even with h5wasm@0.7.3, I still seem to need { type: 'module' }
:shrug:
With this observation in hand, I tried once again to turn the existing H5WasmApi
into an async H5WasmLocalFileApi
of sorts, with a worker implementation similar to the one in h5wasm-worker
... but I still ran into pending promises... :disappointed:
To make debugging easier, I decided to remove as many layers of abstraction as I could and started implementing a worker from scratch that works directly, and solely, with the H5Module
object returned by await ready
. I was making good progress until I once again ran into pending promises. Turns out this was caused by a dumb utility import from the @h5web/app
package in the worker file. Everything worked fine after I moved the utility to the @h5web/shared
package! :tada:
I'm not 100% sure I understand why, and I find it mind boggling that Vite was not warning me about this import somehow. Anyway, I think the approach of using the low-level H5Module
methods in the worker is sound, so I will push forward with it and keep you updated.
After a few fixes (#1615 #1614), I can now confirm that H5WasmLocalFileProvider
works as expected in myHDF5. I was able to instantly open a 5 GB file without problem ... even in FF 78 ESR! So my fear that the worker wouldn't work in FF < 114 was fortunately unfounded. :tada:
I'll try to do the releases and upgrades asap.
Is your feature request related to a problem?
Currently for the h5wasm provider, the entire file must be loaded into memory before use (it is written to the MEMFS virtual file system provided by Emscripten)
This puts an upper limit of 2GB (?) on the size of files that can be used with the h5wasm provider, and can cause memory issues for users (entire file in memory).
The only advantage of this system is that file access (once loaded) is very very fast.
Requested solution or feature
Use the WORKERFS Emscripten file system and a webworker-based h5wasm provider, which allows random-access to files on the users' computer without loading the entire thing into memory.
Alternatives you've considered
The new File System Access API could also solve this problem, where users could mount a local folder for working with and have random access to the files in that folder. This API is only fully implemented on Chrome-based browsers, however.
Additional context
Here is an example worker:
and here is example client code for interacting with the worker: