silx-kit / h5web

React components for data visualization and exploration
https://h5web.panosc.eu/
MIT License
165 stars 17 forks source link

Webworker h5wasm provider (for random access to local files) #1582

Closed bmaranville closed 2 months ago

bmaranville commented 4 months ago

Is your feature request related to a problem?

Currently for the h5wasm provider, the entire file must be loaded into memory before use (it is written to the MEMFS virtual file system provided by Emscripten)

This puts an upper limit of 2GB (?) on the size of files that can be used with the h5wasm provider, and can cause memory issues for users (entire file in memory).

The only advantage of this system is that file access (once loaded) is very very fast.

Requested solution or feature

Use the WORKERFS Emscripten file system and a webworker-based h5wasm provider, which allows random-access to files on the users' computer without loading the entire thing into memory.

Alternatives you've considered

The new File System Access API could also solve this problem, where users could mount a local folder for working with and have random access to the files in that folder. This API is only fully implemented on Chrome-based browsers, however.

Additional context

Here is an example worker:

// h5wasm_worker.js
importScripts("https://unpkg.com/comlink/dist/umd/comlink.js");
importScripts("https://cdn.jsdelivr.net/npm/h5wasm@0.7.1/dist/iife/h5wasm.js")

const WORKING_DIRECTORY = '/working_directory';

async function load_file(file) {
  // file is of type File
  const { FS } = await h5wasm.ready;
  const { filesystems: { WORKERFS } } = FS;
  const { name: filename, size } = file;
  if (!FS.analyzePath(WORKING_DIRECTORY).exists) {
    FS.mkdir(WORKING_DIRECTORY);
  }
  if (api.file !== null) {
    api.file.close();
    // only use a single file at a time;
    // unmount the previous filesystem
    FS.unmount(WORKING_DIRECTORY);
  }

  FS.mount(WORKERFS, { files: [ file ] }, WORKING_DIRECTORY);
  const h5wasmFile = new h5wasm.File(`${WORKING_DIRECTORY}/${filename}`);
  api.file = h5wasmFile;
  return;
}

async function get_entity(path = '/') {
  if (api.file === null) {
    return null;
  }
  return api.file.get(path);
}

async function get_type(path) {
  const entity = await get_entity(path);
  return entity.type;
}

async function get_attrs(path) {
  const entity = await get_entity(path);
  const attrs = entity.attrs;
  return attrs;
}

// Group functions:
async function get_keys(group_path) {
  const entity = await get_entity(group_path);
  if (entity === null) {
    return null;
  }
  // assert entity instanceof h5wasm.Group;
  const keys = entity.keys();
  return keys;
}

// Dataset functions:
async function get_value (dataset_path) {
  const entity = await get_entity(dataset_path);
  if (entity === null) {
    return null;
  }
  // assert entity instanceof h5wasm.Dataset;
  const value = entity.value;
  return value;
}

const api = {
  ready: h5wasm.ready,
  file: null,
  load_file,
  get_entity,
  get_type,
  get_attrs,
  get_keys,
  get_value,
}
Comlink.expose(api);

and here is example client code for interacting with the worker:

<script type="module">
  import * as Comlink from "https://unpkg.com/comlink/dist/esm/comlink.mjs";

  async function init() {
    const worker = new Worker("h5wasm_worker.js");
    const h5wasm_proxy = Comlink.wrap(worker);
    const file = document.getElementById("file");
    file.addEventListener("change", async (event) => {
      const file = event.target.files[0];
      await h5wasm_proxy.load_file(file);
      // example api call on file:
      const keys = await h5wasm_proxy.get_keys("/");
      console.log({keys});
    });
  }
  init();
</script>
bmaranville commented 4 months ago

I did test this code out, and it works - I am able to load a local file and read values, keys, attributes, etc. h5wasm, without reading the whole file contents into memory.

I don't know how to bundle a worker script including all dependencies (it seems kind of tricky), which is why it's written in plain JS with importScripts above.

NOTE: The esm build of h5wasm (dist/esm/hdf5_hl.js) already has WORKERFS support built in. (The iife version used in the worker above is built by further bundling the esm build.)

axelboc commented 4 months ago

Thanks for opening an issue to track this! :100:

I don't know how to bundle a worker script including all dependencies (it seems kind of tricky), which is why it's written in plain JS with importScripts above.

Yeah last time I tried, this is where I hit a wall. Vite has improved a lot since then, so I'll try again asap.

bmaranville commented 4 months ago

Yes, I think things have improved! I replaced the importScripts() above with regular imports, and I was able to build a functioning worker bundle with this esbuild invocation: esbuild --format=esm --bundle worker.ts > worker_bundle.js

// h5wasm_worker.ts
import * as Comlink from "comlink";
import h5wasm from "h5wasm";
axelboc commented 4 months ago

After a whole morning of hair pulling:

  1. I have not yet been able to import a worker written in TS. When I do so, Vite's dev server serves it with an empty mime type for some reason, and it gets blocked by the browser. Perhaps I need to configure Vite somehow to support TS workers but I can't seem to find any mention of this in the documentation, or any mention of this mime type issue anywhere else.
  2. If I use import instead of importScript inside the worker, the promises returned by the Comlink-proxied functions never resolve... This is completely silent; Vite's dev server doesn't show any error and there are no errors in the browser console. I found a way around the issue by instantiating the Worker constructor with { type: "module" }. Vite's documentation on Web Wokers seems to say that import can be used inside workers regardless of { type: "module" }, so I'm not sure what's going on. The problem is that { type: "module" } remains in the build output and Firefox added support for it only in v114, which is quite recent.
bmaranville commented 4 months ago

Yes, I think I found that { type: 'module' } was important as well when using a worker with import statements in it.

On the other hand, once the worker is bundled with esbuild or other as above, it no longer contains any import statements, and is usable in any context I would think.

For electron-like apps (VS Code?) I imagine you can use { type: 'module' }, and maybe use a bundled worker for more general browser contexts (for now?)

bmaranville commented 4 months ago

I have been playing around with this... would it be useful to include a special build of h5wasm that uses a worker in the main h5wasm package?

Here is a setup that works:

// lib_worker.ts
import * as h5wasm from 'h5wasm';

const WORKERFS_MOUNT = '/workerfs';

async function save_to_workerfs(file) {
  const { FS, WORKERFS, mount } = await workerfs_promise;
  const { name: filename, size } = file;
  const output_path = `${WORKERFS_MOUNT}/${filename}`;
  if (FS.analyzePath(output_path).exists) {
    console.warn(`File ${output_path} already exists. Overwriting...`);
  }
  const outfile = WORKERFS.createNode(mount, filename, WORKERFS.FILE_MODE, 0, file, file.lastModifiedDate);
  return output_path;
}

async function _mount_workerfs() {
  const { FS } = await h5wasm.ready;
  const { filesystems: { WORKERFS } } = FS;
  if (!FS.analyzePath(WORKERFS_MOUNT).exists) {
    FS.mkdir(WORKERFS_MOUNT);
  }
  const mount = FS.mount(WORKERFS, {}, WORKERFS_MOUNT);
  return { FS, WORKERFS, mount };
}

const workerfs_promise = _mount_workerfs();

export const api = {
  ready: h5wasm.ready,
  save_to_workerfs,
  H5WasmFile: h5wasm.File,
  Dataset: h5wasm.Dataset,
  Group: h5wasm.Group,
  Datatype: h5wasm.Datatype,
  BrokenSoftLink: h5wasm.BrokenSoftLink,
}
// worker.ts
import * as Comlink from 'comlink'; 
import { api } from './lib_worker';

Comlink.expose(api);
// worker_proxy.ts
import * as Comlink from 'comlink';
import type { api } from './lib_worker.ts';

import { ACCESS_MODES } from './hdf5_hl.ts';
import type { File as H5WasmFile, Group, Dataset, Datatype, BrokenSoftLink } from './hdf5_hl.ts';
export type { H5WasmFile, Group, Dataset, Datatype, BrokenSoftLink };

type ACCESS_MODESTRING = keyof typeof ACCESS_MODES;

const worker = new Worker('./worker.js');
const remote = Comlink.wrap(worker) as Comlink.Remote<typeof api>;

export class GroupProxy {
  proxy: Comlink.Remote<Group>;
  file_id: bigint;
  constructor(proxy: Comlink.Remote<Group>, file_id: bigint) {
    this.proxy = proxy;
    this.file_id = file_id;
  }

  async keys() {
    return await this.proxy.keys();
  }

  async paths() {
    return await this.proxy.paths();
  } 

  async get(name: string = "/") {
    const dumb_obj = await this.proxy.get(name);
    // convert to a proxy of the object:
    if (dumb_obj?.type === "Group") {
      const new_group_proxy = await new remote.Group(dumb_obj.file_id, dumb_obj.path);
      return new GroupProxy(new_group_proxy, this.file_id);
    }
    else if (dumb_obj?.type === "Dataset") {
      return new remote.Dataset(dumb_obj.file_id, dumb_obj.path);
    }
    else if (dumb_obj?.type === "Datatype") {
      return new remote.Datatype();
    }
    else if (dumb_obj?.type === "BrokenSoftLink") {
      return new remote.BrokenSoftLink(dumb_obj?.target);
    }
    return 
  }
}

export class FileProxy extends GroupProxy {
  filename: string;
  mode: ACCESS_MODESTRING;
  constructor(proxy: Comlink.Remote<H5WasmFile>, file_id: bigint, filename: string, mode: ACCESS_MODESTRING = 'r') {
    super(proxy, file_id);
    this.filename = filename;
    this.mode = mode;
  }
}

export async function get_file_proxy(filename: string, mode: ACCESS_MODESTRING = 'r') {
  const file_proxy = await new remote.H5WasmFile(filename, mode);
  const file_id = await file_proxy.file_id;
  return new FileProxy(file_proxy, file_id, filename, mode);
}

export async function save_file(file: File) {
  const { name, lastModified, size } = file;
  console.log(`Saving file ${name} of size ${lastModified} to workerfs...`);
  return await remote.save_to_workerfs(file);
}

Which is then built with these two esbuild commands:

npx esbuild --format=esm --bundle worker.ts > worker.js;
npx esbuild --format=esm --bundle worker_proxy.ts > worker_proxy.mjs;

The resulting library can be used by importing { save_file, get_file_proxy } from worker_proxy.mjs, then use save_file with a user-selected File object from a file input (providing random access to that file without reading it first), and getting a FileProxy object with get_file_proxy(filename). Async get (retrieve proxy to Dataset, another Group, etc.) works as expected, and once you have e.g. a Dataset proxy await dset_proxy.value returns the value.

The reason there are three files instead of two is that it's difficult to build lib_worker.ts into a worker directly with export { api } at the bottom, but you really want that export so you can use the types in worker_proxy.ts, so then worker.ts is just a thin wrapper that exposes api.

axelboc commented 4 months ago

Wow, this is brilliant! Definitely a nice approach.

I'll try to get my head around it a bit more to understand how this will fit into the existing H5Web provider code (notably with loading compression plugins) but either way, we can iterate. I'm planning on providing a separate provider, maybe H5WasmLocalFileProvider, to give us time to experiment and make for a smoother transition. We can then imagine another H5WasmRemoveFileProvider that would perform range HTTP requests with a similar set-up.

bmaranville commented 4 months ago

I moved these files to https://github.com/usnistgov/h5wasm/pull/70 and I added a method for writing bytes to a MEMFS file, which I used for loading a plugin in the example code there. It's still a bit awkward, and I'm realizing there's really no use case for interacting with the filesystems within the worker except to load files and plugins, so I might rejigger my API so that e.g. a save function returns an H5WasmFile proxy instead of just a file path on success, and it would make sense to create a few API functions for loading plugin files and maybe listing the contents of the plugin folder.

axelboc commented 3 months ago

FYI, this is still on my mind. Last time I played with h5wasm-worker and tried to create an H5WasmLocalFileApi, I ran into a wall: async calls to the comlink-wrapped web worker were not resolving; the promises would remain in pending state; no errors, nothing... I need another approach, I think: maybe starting from your code examples and building up.

bmaranville commented 3 months ago

Is your code in a branch somewhere? I'd be happy to help debug.

axelboc commented 3 months ago

With some fresh eyes, I think it may be caused by h5wasm being bundled into h5wasm-worker. I had moved h5wasm to peerDependencies in h5wasm-worker but forgot to also configure ESBuild to mark it as external. Can't believe this is not done automatically :sob: — I'll report back.


Hmm but it has to be bundled into h5wasm-worker, that's the whole point. Maybe somehow I'm using h5wasm directly in H5WasmLocalFileApi instead of via h5wasm-worker...

axelboc commented 3 months ago

I've opened a test branch on the h5wasm-worker repo with a basic index.html and index.js file. As you'll see in my comments, execution doesn't go past the first await statement. To reproduce: npm install, npm run build and npx serve. Am I missing something?


The problem seems to come from esbuild-plugin-inline-worker. If I compile and load the worker "manually", it works fine — cf. branch test-2.

axelboc commented 3 months ago

I've made progress back in this repo by embracing new Worker(new URL(...), { type: "module" }). With this syntax, the promises of the wrapped TS worker resolve as expected. :tada: I guess I'll just have to add some sort of feature detection to make sure H5Web consumers can easily fall back to the MEMFS implementation.

I'll push forward, taking inspiration from what you've done in h5wasm-worker. I have a feeling that typing strictness is going to be a challenge with the proposed API, but I'll report back once I've made more progress.

bmaranville commented 3 months ago

Yes - you have found the issue. I was about to write back to you. In fact, I think we can remove the reference to import.meta in the h5wasm module which will remove the need for {type: "module"}, using a compile flag for emscripten.

bmaranville commented 3 months ago

Can you try now, without the {type: "module"} option? I published a new version of h5wasm and updated the dependency in h5wasm-worker

axelboc commented 3 months ago

I rebased the test branch in h5wasm-worker and it worked. :tada: Unfortunately, I still ran into pending promises when I tried h5wasm-worker in H5Web...

Then, I implemented a very dumb worker in H5Web with new Worker(new URL(...), { type: "module" }). Even with h5wasm@0.7.3, I still seem to need { type: 'module' } :shrug:

With this observation in hand, I tried once again to turn the existing H5WasmApi into an async H5WasmLocalFileApi of sorts, with a worker implementation similar to the one in h5wasm-worker... but I still ran into pending promises... :disappointed:

To make debugging easier, I decided to remove as many layers of abstraction as I could and started implementing a worker from scratch that works directly, and solely, with the H5Module object returned by await ready. I was making good progress until I once again ran into pending promises. Turns out this was caused by a dumb utility import from the @h5web/app package in the worker file. Everything worked fine after I moved the utility to the @h5web/shared package! :tada:

I'm not 100% sure I understand why, and I find it mind boggling that Vite was not warning me about this import somehow. Anyway, I think the approach of using the low-level H5Module methods in the worker is sound, so I will push forward with it and keep you updated.

axelboc commented 2 months ago

After a few fixes (#1615 #1614), I can now confirm that H5WasmLocalFileProvider works as expected in myHDF5. I was able to instantly open a 5 GB file without problem ... even in FF 78 ESR! So my fear that the worker wouldn't work in FF < 114 was fortunately unfounded. :tada:

I'll try to do the releases and upgrades asap.