microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.43k stars 2.75k forks source link

[Web] WebGPU and WASM Backends Unavailable within Service Worker #20876

Open ggaabe opened 1 month ago

ggaabe commented 1 month ago

Describe the issue

I'm running into issues trying to use the WebGPU or WASM backends inside of a ServiceWorker (on a chrome extension). More specifically, I'm attempting to use Phi-3 with transformers.js v3

Every time I attempt this, I get the following error:

Uncaught (in promise) Error: no available backend found. ERR: [webgpu] 
TypeError: import() is disallowed on ServiceWorkerGlobalScope by the HTML specification. 
See https://github.com/w3c/ServiceWorker/issues/1356.

This is originating in the InferenceSession class in js/common/lib/inference-session-impl.ts.

More specifically, it's happening in this method: const [backend, optionsWithValidatedEPs] = await resolveBackendAndExecutionProviders(options); where the implementation is in js/common/lib/backend-impl.ts and the tryResolveAndInitializeBackend fails to initialize any of the execution providers.

WebGPU is now supported in ServiceWorkers though; it is a recent change and it should be feasible. Here were the chrome release notes.

Additionally, here is an example browser extension from the mlc-ai/web-llm framework that implements WebGPU usage in service workers successfully: https://github.com/mlc-ai/web-llm/tree/main/examples/chrome-extension-webgpu-service-worker

Here is some further discussion on this new support from Google itself: https://groups.google.com/a/chromium.org/g/chromium-extensions/c/ZEcSLsjCw84/m/WkQa5LAHAQAJ

So technically I think it should be possible for this to be supported now? Unless I'm doing something else glaringly wrong. Is it possible to add support for this?

To reproduce

Download and set up the transformers.js extension example and put this into the background.js file:

// background.js - Handles requests from the UI, runs the model, then sends back a response

import {
  pipeline,
  env,
  AutoModelForCausalLM,
  AutoTokenizer,
  TextStreamer,
  StoppingCriteria,
} from "@xenova/transformers";

// Skip initial check for local models, since we are not loading any local models.
env.allowLocalModels = false;

// Due to a bug in onnxruntime-web, we must disable multithreading for now.
// See https://github.com/microsoft/onnxruntime/issues/14445 for more information.
env.backends.onnx.wasm.numThreads = 1;

class CallbackTextStreamer extends TextStreamer {
  constructor(tokenizer, cb) {
    super(tokenizer, {
      skip_prompt: true,
      skip_special_tokens: true,
    });
    this.cb = cb;
  }

  on_finalized_text(text) {
    this.cb(text);
  }
}

class InterruptableStoppingCriteria extends StoppingCriteria {
  constructor() {
    super();
    this.interrupted = false;
  }

  interrupt() {
    this.interrupted = true;
  }

  reset() {
    this.interrupted = false;
  }

  _call(input_ids, scores) {
    return new Array(input_ids.length).fill(this.interrupted);
  }
}

const stopping_criteria = new InterruptableStoppingCriteria();

async function hasFp16() {
  try {
    const adapter = await navigator.gpu.requestAdapter();
    return adapter.features.has("shader-f16");
  } catch (e) {
    return false;
  }
}

class PipelineSingleton {
  static task = "feature-extraction";
  static model_id = "Xenova/Phi-3-mini-4k-instruct_fp16";
  static model = null;
  static instance = null;

  static async getInstance(progress_callback = null) {
    this.model_id ??= (await hasFp16())
      ? "Xenova/Phi-3-mini-4k-instruct_fp16"
      : "Xenova/Phi-3-mini-4k-instruct";

    this.tokenizer ??= AutoTokenizer.from_pretrained(this.model_id, {
      legacy: true,
      progress_callback,
    });

    this.model ??= AutoModelForCausalLM.from_pretrained(this.model_id, {
      dtype: "q4",
      device: "webgpu",
      use_external_data_format: true,
      progress_callback,
    });

    return Promise.all([this.tokenizer, this.model]);
  }
}

// Create generic classify function, which will be reused for the different types of events.
const classify = async (text) => {
  // Get the pipeline instance. This will load and build the model when run for the first time.
  const [tokenizer, model] = await PipelineSingleton.getInstance((data) => {
    // You can track the progress of the pipeline creation here.
    // e.g., you can send `data` back to the UI to indicate a progress bar
    console.log("progress", data);
    // data logs as this:
    /**
     * 
     * {
    "status": "progress",
    "name": "Xenova/Phi-3-mini-4k-instruct_fp16",
    "file": "onnx/model_q4.onnx",
    "progress": 99.80381792394503,
    "loaded": 836435968,
    "total": 838080131
  }

  when complete, last status will be 'done'
     */
  });
  /////////////
  const inputs = tokenizer.apply_chat_template(text, {
    add_generation_prompt: true,
    return_dict: true,
  });

  let startTime;
  let numTokens = 0;
  const cb = (output) => {
    startTime ??= performance.now();

    let tps;
    if (numTokens++ > 0) {
      tps = (numTokens / (performance.now() - startTime)) * 1000;
    }
    self.postMessage({
      status: "update",
      output,
      tps,
      numTokens,
    });
  };

  const streamer = new CallbackTextStreamer(tokenizer, cb);

  // Tell the main thread we are starting
  self.postMessage({ status: "start" });

  const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 512,
    streamer,
    stopping_criteria,
  });
  const outputText = tokenizer.batch_decode(outputs, {
    skip_special_tokens: false,
  });

  // Send the output back to the main thread
  self.postMessage({
    status: "complete",
    output: outputText,
  });
  ///////////////

  // Actually run the model on the input text
  // let result = await model(text);
  // return result;
};

////////////////////// 1. Context Menus //////////////////////
//
// Add a listener to create the initial context menu items,
// context menu items only need to be created at runtime.onInstalled
chrome.runtime.onInstalled.addListener(function () {
  // Register a context menu item that will only show up for selection text.
  chrome.contextMenus.create({
    id: "classify-selection",
    title: 'Classify "%s"',
    contexts: ["selection"],
  });
});

// Perform inference when the user clicks a context menu
chrome.contextMenus.onClicked.addListener(async (info, tab) => {
  // Ignore context menu clicks that are not for classifications (or when there is no input)
  if (info.menuItemId !== "classify-selection" || !info.selectionText) return;

  // Perform classification on the selected text
  let result = await classify(info.selectionText);

  // Do something with the result
  chrome.scripting.executeScript({
    target: { tabId: tab.id }, // Run in the tab that the user clicked in
    args: [result], // The arguments to pass to the function
    function: (result) => {
      // The function to run
      // NOTE: This function is run in the context of the web page, meaning that `document` is available.
      console.log("result", result);
      console.log("document", document);
    },
  });
});
//////////////////////////////////////////////////////////////

////////////////////// 2. Message Events /////////////////////
//
// Listen for messages from the UI, process it, and send the result back.
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  console.log("sender", sender);
  if (message.action !== "classify") return; // Ignore messages that are not meant for classification.

  // Run model prediction asynchronously
  (async function () {
    // Perform classification
    let result = await classify(message.text);

    // Send response back to UI
    sendResponse(result);
  })();

  // return true to indicate we will send a response asynchronously
  // see https://stackoverflow.com/a/46628145 for more information
  return true;
});

Urgency

this would help enable a new ecosystem to build up around locally intelligent browser extensions and tooling.

it's urgent for me because it would be fun to build and I want to build it and it would be fun to be building it rather than not be building it.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.19.0-dev.20240509-69cfcba38a

Execution Provider

'webgpu' (WebGPU)

fs-eire commented 1 month ago

Than you for reporting this issue. I will try to figure out how to fix this problem.

fs-eire commented 1 month ago

So it turns out to be that dynamic import (ie. import()) and top-level await is not supported in current service worker. I was not expecting that import() is banned in SW.

Currently, the WebAssembly factory (wasm-factory.ts) uses dynamic import to load the JS glue. This does not work in service worker. A few potential solutions are also not available:

I am now trying to make a JS bundle that does not use dynamic import for usage of service worker specifically. Still working on it

ggaabe commented 1 month ago

Thanks, I appreciate your efforts around this. It does seem like some special-case bundle will need to be built after all; you might need iife or umd for the bundler output format

fs-eire commented 4 weeks ago

Thanks, I appreciate your efforts around this. It does seem like some special-case bundle will need to be built after all; you might need iife or umd for the bundler output format

I have considered this option. However, Emscripten does not offer an option to output both UMD(IIFE+CJS) & ESM for JS glue (https://github.com/emscripten-core/emscripten/issues/21899). I have to choose either. I choose the ES6 format output for the JS glue, because of a couple of problems when import UMD from ESM, and import() is a standard way to import ESM from both ESM and UMD. ( Until I know its not working in service worker by this issue)

I found a way to make ORT web working, - yes this need the build script to do some special handling. And this will only work for ESM, because the JS glue is ESM and it seems no way to import ESM from UMD in service worker.

fs-eire commented 3 weeks ago

@ggaabe Could you please help to try import * as ort from “./ort.webgpu.bundle.min.js” from version 1.19.0-dev.20240604-3dd6fcc089 ?

ggaabe commented 3 weeks ago

@fs-eire my project is dependent on transformersjs, which imports onnxruntime webgpu backend like this here:

https://github.com/xenova/transformers.js/blob/v3/src/backends/onnx.js#L24

Is this the right usage? In my project I've added this to my package.json to resolve onnx-runtime to this new version though the issue is still occurring:

  "overrides": {
    "onnxruntime-web": "1.19.0-dev.20240604-3dd6fcc089"
  }
ggaabe commented 3 weeks ago

Maybe also important: The same error is still occurring in same spot in inference session in the onnx package and not from transformersjs. Do I need to add a resolver for onnxruntime-common as well?

fs-eire commented 3 weeks ago

20991 makes default ESM import to use non-dynamic-import and hope this change may fix this problem. PR is still in progress

ggaabe commented 2 weeks ago

Hi @fs-eire, is the newly-merged fix in a released build I can try?

fs-eire commented 2 weeks ago

Please try 1.19.0-dev.20240612-94aa21c3dd

ggaabe commented 2 weeks ago

@fs-eire EDIT: Nvm the comment I just deleted, that error was because I didn't set the webpack target to webworker.

However, I'm getting a new error now (progress!):

Error: no available backend found. ERR: [webgpu] RuntimeError: null function or function signature mismatch

ggaabe commented 2 weeks ago

Update: Found the error is happening in here: https://github.com/microsoft/onnxruntime/blob/fff68c3151b774d8a2e9290e96b9f707cd950216/js/common/lib/backend-impl.ts#L83-L86

For some reason the webgpu backend.init promise is rejecting due to the null function or function signature mismatch error. This is much further along than we were before though.

fs-eire commented 2 weeks ago

Update: Found the error is happening in here:

https://github.com/microsoft/onnxruntime/blob/fff68c3151b774d8a2e9290e96b9f707cd950216/js/common/lib/backend-impl.ts#L83-L86

For some reason the webgpu backend.init promise is rejecting due to the null function or function signature mismatch error. This is much further along than we were before though.

Could you share me the reproduce steps?

ggaabe commented 2 weeks ago

@fs-eire You'll need to run the webGPU setup in a chrome extension.

  1. You can use my code I just published here: https://github.com/ggaabe/extension

  2. run npm install

  3. run npm run build

  4. open the chrome manage extensions

    Screenshot 2024-06-14 at 9 37 14 AM
  5. load unpacked

    Screenshot 2024-06-14 at 9 37 52 AM
  6. select the build folder from the repo.

  7. open the AI WebGPU Extension extension

  8. type some text in the text input. it will load Phi-3 mini and after finishing loading this error will occur

  9. if you view the extension in the extension in the extension manager and select the "Inspect views service worker" link before opening the extension it will bring up an inspection window to view the errors as they occur. A little "errors" bubble link also shows up here after they occur.

    Screenshot 2024-06-14 at 9 40 48 AM
  10. You will need to click the "Refresh" button on the extension in the extension manager to rerun the error because it does not attempt reloading the model after the first attempt until another refresh

fs-eire commented 2 weeks ago

@ggaabe I did some debug on my box and made some fixes -

  1. Changes to ONNXRuntime Web:

    21073 is created to make sure the web assembly file can be loaded correctly when env.wasm.wasmPaths is not specified.

  2. Changes to https://github.com/ggaabe/extension

    https://github.com/ggaabe/extension/pull/1 need to be made to the extension example, to make it load the model correctly. Please note:

    • The onnxruntime-web version need to be updated to consume changes from (1) (after it get merged and published for dev channel)
    • There are still errors in background.js, which looks like incorrect params passed to tokenizer.apply_chat_template(). However, the WebAssembly is initialized and the model loaded successfully.
  3. Other issues:

    • Transformerjs overrides env.wasm.wasmPaths to a CDN URL internally. At least for this example, we don't want this behavior so we need to reset it to undefined to keep the default behavior.
    • Multi-threaded CPU EP is not supported because Worker is not accessible in service worker. Issue tracking: https://github.com/whatwg/html/issues/8362
ggaabe commented 1 week ago

Awesome, thank you for your thoroughness in explaining this and tackling this head on. Is there a dev channel version I can test out?

fs-eire commented 1 week ago

Not yet. Will update here once it is ready.

ggaabe commented 1 week ago

sorry to bug; is there any dev build number? wasn't sure how often a release runs

fs-eire commented 1 week ago

sorry to bug; is there any dev build number? wasn't sure how often a release runs

Please try 1.19.0-dev.20240621-69d522f4e9

ggaabe commented 1 week ago

@fs-eire I'm getting one new error:

ort.webgpu.bundle.min.mjs:6 Uncaught (in promise) Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.
    at get data (ort.webgpu.bundle.min.mjs:6:13062)
    at get data (tensor.js:62:1)

I pushed the code changes to my repo and fixed the call to the tokenizer. To reproduce, just type 1 letter in the chrome extension’s text input and wait

nickl1234567 commented 1 week ago

Hey, I also need this. I am struggling with importing this version. So far I have been importing ONNX using import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/esm/ort.webgpu.min.js". However, when I change to import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.19.0-dev.20240621-69d522f4e9/dist/esm/ort.webgpu.min.js" it seems not to have an .../esm/ folder. Do you know why that is and how to import it then?

fs-eire commented 1 week ago

Hey, I also need this. I am struggling with importing this version. So far I have been importing ONNX using import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/esm/ort.webgpu.min.js". However, when I change to import * as ort from "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.19.0-dev.20240621-69d522f4e9/dist/esm/ort.webgpu.min.js" it seems not to have an .../esm/ folder. Do you know why that is and how to import it then?

just replace .../esm/ort.webgpu.min.js to .../ort.webgpu.min.mjs should work. If you are also using service worker, use ort.webgpu.bundle.min.mjs instead of ort.webgpu.min.mjs.

fs-eire commented 1 week ago

@fs-eire I'm getting one new error:

ort.webgpu.bundle.min.mjs:6 Uncaught (in promise) Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.
    at get data (ort.webgpu.bundle.min.mjs:6:13062)
    at get data (tensor.js:62:1)

I pushed the code changes to my repo and fixed the call to the tokenizer. To reproduce, just type 1 letter in the chrome extension’s text input and wait

This may be a problem of transformerjs. Could you try whether this problem happen in a normal page? If so, can report the issue to transformerjs. If it's only happening in service worker, I can take a closer look