huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
11.81k stars 745 forks source link

The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

Open raodaqi opened 4 months ago

raodaqi commented 4 months ago

System Info

vue

Environment/Platform

Description

pipeline(this.task, this.model, { dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken) }, device: 'webgpu', progress_callback, }); I came up when SharedWorker was using webgpu "Error: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly." This problem image

Reproduction

worker.js pipeline(this.task, this.model, { dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken) }, device: 'webgpu', progress_callback, });

app.vue
new SharedWorker(new URL('./worker.js', import.meta.url), { type: 'module', });

xenova commented 4 months ago

Does this also happen when using a normal web worker? 👀

raodaqi commented 4 months ago

Does this also happen when using a normal web worker? 👀使用普通网络工作者时也会发生这种情况吗? 👀

Normal workers do not have this problem, but Sharedworkers do.

xenova commented 4 months ago

Interesting - thanks for the information! @guschmue any idea what's going wrong?

guschmue commented 4 months ago

Not sure about Sharedworkers, let me look at it. Without looking I'd assume Sharedworkers will behave similar to proxy = true.

kyr0 commented 3 months ago

@xenova Debugging the codebase I spotted one glitch after another.

I'm running my code in a web worker (specifically in a web extension), and was getting issues with dynamic imports and webgpu backend not being available. So I checked for issues with the onnxruntime-web package, as this seemed to be an upstream issue. I found https://github.com/microsoft/onnxruntime/issues/20876 and switched to onnxruntime-web@1.19.0-dev.20240621-69d522f4e9 as suggested by the developer.

After that, I ran in plenty of issues with dynamic imports, transformerjs is trying to do in env.js, so I got a bit tired and simply forked it, removed the code that would try to use Node.js functionality, and got rid of all the auto loading. I ended up reverse engineering the pipeline that I needed for intfloat/multilingual-e5-small and got the following code running fine, the session is initialized just fine.., all good. It runs without issues until await model(modelInputs); is invoked.

  // load tokenizer config
  const tokenizerConfig = mlModel.tokenizerConfig;
  const tokenizerJSON = JSON.parse(
    new TextDecoder("utf-8").decode(await mlModel.tokenizer.arrayBuffer()),
  );

  console.log("tokenizerConfig", tokenizerConfig);
  console.log("tokenizer", tokenizerJSON);

  // create tokenizer
  const tokenizer = new XLMRobertaTokenizer(tokenizerJSON, tokenizerConfig);

  console.log("tokenizer", tokenizer);

  // tokenize input
  const modelInputs = tokenizer(["foo", "bar"], {
    padding: true,
    truncation: true,
  });

  console.log("modelInputs", modelInputs);

  // https://huggingface.co/Xenova/multilingual-e5-small in ORT format
  const mlBinaryModelBuffer = await mlModel.blob.arrayBuffer();

  const modelSession = await ONNX_WEBGPU.InferenceSession.create(
    mlBinaryModelBuffer,
    {
      executionProviders: ["webgpu"],
    },
  );
  console.log("Created model session", modelSession);

  const modelConfig = mlModel.config;
  console.log("modelConfig", modelConfig);

  const model = new BertModel(modelConfig, modelSession);
  console.log("model", model);

  const outputs = await model(modelInputs);

  let result =
    outputs.last_hidden_state ?? outputs.logits ?? outputs.token_embeddings;
  console.log("result", result);

  result = mean_pooling(result, modelInputs.attention_mask);
  console.log("meanPooling result", result);

  // normalize embeddings
  result = result.normalize(2, -1);

  console.log("normalized result", result);

When I run the model, which calls encoderForward(), the first issue occured: Setting the token_type_ids a zeroed Tensor didn't work, because apparently, model_inputs.input_ids.data was undefined.

So, why was it undefined? I noticed that the proxied Tensor instance's token_type_ids has a property called dataLocation now (not location) and it was set to "cpu". Also the data property wasn't existing, but now there is a cpuData storing the data:

Bildschirmfoto 2024-07-09 um 02 19 40

I tried my luck with this patch:

Bildschirmfoto 2024-07-09 um 02 21 17

And at least creating encoderFeeds.token_type_ids worked.

Checking the other comments on this issue: https://github.com/microsoft/onnxruntime/issues/20876#issuecomment-2185339554 I realized that I'm not the only one running into this, and checking here, I think this could lead in the same direction. This guy also had an issue right after tokenization and invoking the model, it seems like...

Next step - onnxruntime-web really doesn't like it's own datastructure:

Bildschirmfoto 2024-07-09 um 02 22 49 Bildschirmfoto 2024-07-09 um 02 28 31

So I tried my luck with another nasty hack...

Bildschirmfoto 2024-07-09 um 01 52 04

But yeah, it doesn't help... the data structure simply seems to have changed in an incompatible way, as after all of that monkey patching of data structures, we get....

Bildschirmfoto 2024-07-09 um 01 52 51 Bildschirmfoto 2024-07-09 um 02 10 48

As we could see in the screenshot before, the code would access e.data instead of cpuData again and this could lead to some .byteLength of undefined, potentially. So I tried:

Bildschirmfoto 2024-07-09 um 02 34 12

But it did not help...

And here I had enough debugging fun for today... good night xD

guschmue commented 3 months ago

Location is not intended to be set because just setting it would not move the data into the right place. Only way to set location is indirectly via 'new ort.Tensor()' or 'ort.Tensor.fromGpuBuffer()'.

I'm not sure how input_ids could ever be not on cpu because the only way to get it to be not on cpu is to call ort.Tensor.fromGpuBuffer() or list an output in session_options.preferredOutputLocation. transformer.js is not calling the first and input_ids would never be an output.

Possible related to the transformers.js Tensor class. When we introduced gpubuffers, the transformers.js Tensor was changed to keep the original ort tensor instead of wrapping the ort tensor (the coder here: https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js#L43).

Let me try your example.

kyr0 commented 3 months ago

Thank you for your response @guschmue. That makes sense. You can try my example by cloning https://github.com/kyr0/redaktool running bun install && bun run dev after that, you can load the extension in Chrome or Edge by visiting chrome://extensions/, enable developer mode and simply "Load unpacked extension". After selecting the folder of the extension code cloned (the folder that has the manifest.json) it will load. You can simply open a new tab of choice or reload an existing one. Opening the service worker from the chrome://extensions/ tab will show the log. Until modelInputs it works fine. If you uncomment all the code down to https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/model.ts#L69 and run bun run dev again, re-install the extension code, reload the tab and re-open the service worker log --> you'll find that the issues start to occur. You can find my monkey patching tries here: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L503 and there: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L197 (you may want to comment the latter to restore normal behaviour)

The rest of the transformer.js code is form yesterdays revision of this code base. A simple copy & paste with some imports removed as they collide with Worker limitations.

Thank you in advance for taking a look!

guschmue commented 3 months ago

oh, this looks cool. Let me try to get it to work. Bekommen wir schon hin :)

kyr0 commented 3 months ago

@guschmue Klasse, vielen Dank! :) 💯 I'm also available via Discord 👍 https://discord.gg/4wR9t7cdWc

kyr0 commented 3 months ago

@guschmue Is there any way how I can help? Please let me know :) I could spend some time this weekend, debugging and fixing things =)

guschmue commented 3 months ago

I tested chrome extensions with webgpu and that works fine. Getting to your code soon.

guschmue commented 3 months ago

a slightly modified version of your code works for me:

import { env, pipeline, AutoModel, AutoTokenizer } from '@xenova/transformers';

env.localModelPath = 'models/';
env.allowRemoteModels = false;
env.allowLocalModels = true;
env.backends.onnx.wasm.wasmPaths = "/public/";
env.backends.onnx.wasm.proxy = false;

const model_name = 'Xenova/multilingual-e5-small';
const tokenizer = await AutoTokenizer.from_pretrained(model_name);
let model = undefined;
AutoModel.from_pretrained(model_name, {device: 'webgpu'}).then((a) => {
    model = a;
    self.postMessage({
        status: 'ready'
    });
});

async function run(input_text) {
    const tokens = tokenizer(input_text);
    const output = await model(tokens);
    console.log(output);
    return "done";
}

self.addEventListener('message', async (event) => {
    const data = event.data;
    const text = data.text;

    run(text).then((result) => {
        self.postMessage({
            status: 'resp',
            text: result
        });
    });
});

and returns:

ot {cpuData: Float32Array(88320), dataLocation: 'cpu', type: 'float32', dims: Array(3), size: 88320}
kyr0 commented 3 months ago

Thank you @guschmue, I'll try to reproduce on my side and will get back to you soon.

ChTiSh commented 3 months ago

Ha, running into the same issue and found my place here, thank you again @kyr0 <3

kyr0 commented 3 months ago

@guschmue Hmm.. are you sure that it worked for you with the webgpu backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that @xenova/transformers has webgpu disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28

(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with forwardEncode on the ONNX runtime, running into several issues here)

Well, currently, with your code, I'm ending up in a no available backend found. error.

Bildschirmfoto 2024-07-18 um 21 42 02

Btw. from_pretrained isn't processing the device property either, if I'm not mistaken: https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208

I was wondering a bit, and checked for the device symbol in the whole repo and found only docs related code: https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=code

Also, it would be interesting for me to know how the postMessage code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to /public/.

I've changed the build system in my project... so I think from now on I can use a fork of the current main revision and track down the issue more easily...

@ChTiSh Haha, welcome to the "stuck club" ;)

xenova commented 3 months ago

@kyr0 GitHub search doesn't index branches other than main, so you would need to inspect the code directly. For example, the device is set here:

https://github.com/xenova/transformers.js/blob/1b4d2428225ef8f63be94bfa38a0d7fd81ac7c0c/src/models.js#L149-L162

kyr0 commented 3 months ago

@xenova Right.. however, the import in the code here was from @xenova/transformers, so I was assuming that the latest published version, aka @xenova/transformers@2.17.2.

But he probably has the package locally linked to a build of the v3 branch? I'll re-verify with v3 locally. Sorry for the confusion..

ChTiSh commented 3 months ago

I think V3 is linked with 1.18 onnx webgpu.

On Thu, Jul 18, 2024, 5:38 p.m. Aron Homberg @.***> wrote:

@xenova https://github.com/xenova Right.. however, the import in the example code was from @xenova/transformers, so I was assuming that the latest published version, aka @@.*** was meant.

But he probably has the package locally linked to a build of the (v3)[ https://github.com/xenova/transformers.js/tree/v3] branch? I'll re-verify with v3 locally. Sorry for the confusion..

— Reply to this email directly, view it on GitHub https://github.com/xenova/transformers.js/issues/824#issuecomment-2237815606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWQJC5YTYV74HMURTHKBULTZNBNX5AVCNFSM6AAAAABJ7IASMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXHAYTKNRQGY . You are receiving this because you were mentioned.Message ID: @.***>

kyr0 commented 3 months ago

@ChTiSh Right. I forked it, checked it out locally, pinned it to 1.19.0-dev.20240621-69d522f4e9, built it, linked locally and used the code from above. But it attempts to import() dynamically, and that's not gonna work in a worker extension context:

Bildschirmfoto 2024-07-19 um 03 25 48

If there only was an option to import the runtime from user space and also pass down the WASM runtime Module as a Blob. I have it all.. but both libraries (onnxruntime and transformers.js) try hard to fetch()/import() dynamically.

Maybe I can monkey patch it to provide IoC.. I had a hook working with the main branch where the internal call to importWasmModule would be passed up to the userland code so that I could implement env.backends.onnx.importWasmModule like that:

// copied that over from the `onnxruntime-web`
import getModule from "./transformers/ort-wasm-simd-threaded.jsep";

 env.backends.onnx.importWasmModule = async (
    mjsPathOverride: string,
    wasmPrefixOverride: string,
    threading: boolean,
  ) => {
    return [
      undefined,
      async (moduleArgs = {}) => {
        console.log("moduleArgs", moduleArgs); // got called, continued well...
        return await getModule(moduleArgs);
      },
    ];
  };

But then, the emscripten generated WASM runtime wrapper JS code would still attempt to fetch() the actual WASM file. Still looking to pass the Blob or an object URL in via moduleArgs or trick the backend state to mock the importWasmModule completely and assign the internal WASM module reference via some other way.

kyr0 commented 3 months ago

Maybe it should be highlighted again, that this is probably not a problem with "simple" web extensions and their content scripts. I'm talking running it in a service worker of a web extension (background script).

// excerpt from manifest.json
"background": {
    "service_worker": "src/worker.ts", // <- Transformers.js is imported here.
    "type": "module"
  },
ChTiSh commented 3 months ago

This might sound insane. I might be completely hallucinating, but I went through the whole process of force over writing onnx runtime to 1.19, then change the default to resolve the conflict, but at the very end state, I reached the exactly the same outcome with putting nothing related to onnx in service worker except for the simple thread configuration, and literally just have 1 line being '''device: webgpu''' in the instance.

kyr0 commented 3 months ago

Yeah, the "funny" part is, if you debug it through, will Tensorflow.js/ORT internally actually use the GPU? Because here I am, running the code like that and... have fun checking the screenshots :)

Code:

Bildschirmfoto 2024-07-19 um 04 33 13

When I debug from_pretrained(), we can clearly see that the session is configured to use webgpu:

Bildschirmfoto 2024-07-19 um 04 16 44

It's construction a session...

Bildschirmfoto 2024-07-19 um 04 18 44

But here the fun begins... why is the instance a WebAssembly one?

Bildschirmfoto 2024-07-19 um 04 19 06 Bildschirmfoto 2024-07-19 um 04 22 31

Result:

Bildschirmfoto 2024-07-19 um 04 36 19

I guess I'm hallucinating too xD Well, I haven't checked the ORT implementation... maybe the WASM calls through to WebGPU and returns the data via HEAP which is then passed by the runtime back/deserialized as an ONNX Tensor reporting location as cpu... but it's late, 4:30am again, I'm going to sleep... :)

guschmue commented 3 months ago

catching up ... yes, I'm sure I'm using webgpu. My package.json points to a local repo with the transformers.js v3 branch. And I can see in the server worker console logs from webgpu that you can enable with

env.backends.onnx.logLevel = "verbose";
env.backends.onnx.debug = true;

@guschmue Hmm.. are you sure that it worked for you with the webgpu backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that @xenova/transformers has webgpu disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28

(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with forwardEncode on the ONNX runtime, running into several issues here)

Well, currently, with your code, I'm ending up in a no available backend found. error. Bildschirmfoto 2024-07-18 um 21 42 02

Btw. from_pretrained isn't processing the device property either, if I'm not mistaken: https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208

I was wondering a bit, and checked for the device symbol in the whole repo and found only docs related code: https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=code

Also, it would be interesting for me to know how the postMessage code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to /public/.

I've changed the build system in my project... so I think from now on I can use a fork of the current main revision and track down the issue more easily...

@ChTiSh Haha, welcome to the "stuck club" ;)

guschmue commented 3 months ago

q8 isn't going to work with webgpu - webgpu itself doesn't support it yet (but might come). We'd fall back to the wasm op. env.backends.onnx.logLevel = "verbose"; would tell on which device each op landed.

image

kyr0 commented 3 months ago

Thank you @guschmue ! That explains the different runtime behaviour.

Well, off-topic limited core functionality qint8 support is growing, and to some extend, available at least in recent versions of Chrome. You can checkout my code to verify: https://github.com/kyr0/fast-dotproduct/blob/main/experiments/dot4U8Packed.js#L29

But yeah, there is no generalized shader-u8 or anything, that's right. There's only shader-f16 for float16 data types: https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L231

I should have thought about that. Thanks for the heads-up. Now that I think about it, it's obvious.

And man, there is so much potential for optimization in this backend impl.. Somebody probably should rewrite all the looping over data structure in WebAssembly or at least unroll the loop to be JIT-optimizer friendly... https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L532

I demonstrated the gains for using performance optimized code here: https://github.com/kyr0/fast-dotproduct Analyzing this repo's code, I realized that especially https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js and https://github.com/xenova/transformers.js/blob/v3/src/utils/maths.js could also massively benefit from JIT-optimization and a WebAssembly based implementation.

The fast-dotproduct repo also demonstrates how, using emscripten, one can inline the emscripten-generated WASM binary in the runtime file, and the runtime-file in the library file, so that there is no need to load anything dynamically. It's available instantly. WebAssembly nowadays is absolutely evergreen with > 97% in https://caniuse.com/wasm -- I think, there's not even the need to check for the constructor to be available or no :)

Just a few ideas..

ps.: Once a good test coverage would set the baseline for how each algo should work exactly, it would be safe to implement an alternative in WebAssembly without much breaking changes. Currently, the coverage isn't exactly great, but I guess I understand why.. just for such an attempt as of writing an alternative set of implementations, it would really make sense in a pragmatic sense to prevent regressions :)

I'd be willing to start working on optimizing mean pooling / normalization as I need that to be fast for my in-browser vector db just in case there is a consensus on that being a good idea :) (I'm normalizing my locally inferred text embeddings so that a simple dot product would yield me a cosine similarity score as the magnitudes are 1 already; so "insert speed" currently has a bottleneck that is the Transformers.js nomalization and pooling algos)