Open raodaqi opened 4 months ago
Does this also happen when using a normal web worker? 👀
Does this also happen when using a normal web worker? 👀使用普通网络工作者时也会发生这种情况吗? 👀
Normal workers do not have this problem, but Sharedworkers do.
Interesting - thanks for the information! @guschmue any idea what's going wrong?
Not sure about Sharedworkers, let me look at it. Without looking I'd assume Sharedworkers will behave similar to proxy = true.
@xenova Debugging the codebase I spotted one glitch after another.
I'm running my code in a web worker (specifically in a web extension), and was getting issues with dynamic imports and webgpu
backend not being available. So I checked for issues with the onnxruntime-web
package, as this seemed to be an upstream issue. I found https://github.com/microsoft/onnxruntime/issues/20876 and switched to onnxruntime-web@1.19.0-dev.20240621-69d522f4e9
as suggested by the developer.
After that, I ran in plenty of issues with dynamic imports, transformerjs is trying to do in env.js
, so I got a bit tired and simply forked it, removed the code that would try to use Node.js
functionality, and got rid of all the auto loading. I ended up reverse engineering the pipeline that I needed for intfloat/multilingual-e5-small
and got the following code running fine, the session is initialized just fine.., all good. It runs without issues until await model(modelInputs);
is invoked.
// load tokenizer config
const tokenizerConfig = mlModel.tokenizerConfig;
const tokenizerJSON = JSON.parse(
new TextDecoder("utf-8").decode(await mlModel.tokenizer.arrayBuffer()),
);
console.log("tokenizerConfig", tokenizerConfig);
console.log("tokenizer", tokenizerJSON);
// create tokenizer
const tokenizer = new XLMRobertaTokenizer(tokenizerJSON, tokenizerConfig);
console.log("tokenizer", tokenizer);
// tokenize input
const modelInputs = tokenizer(["foo", "bar"], {
padding: true,
truncation: true,
});
console.log("modelInputs", modelInputs);
// https://huggingface.co/Xenova/multilingual-e5-small in ORT format
const mlBinaryModelBuffer = await mlModel.blob.arrayBuffer();
const modelSession = await ONNX_WEBGPU.InferenceSession.create(
mlBinaryModelBuffer,
{
executionProviders: ["webgpu"],
},
);
console.log("Created model session", modelSession);
const modelConfig = mlModel.config;
console.log("modelConfig", modelConfig);
const model = new BertModel(modelConfig, modelSession);
console.log("model", model);
const outputs = await model(modelInputs);
let result =
outputs.last_hidden_state ?? outputs.logits ?? outputs.token_embeddings;
console.log("result", result);
result = mean_pooling(result, modelInputs.attention_mask);
console.log("meanPooling result", result);
// normalize embeddings
result = result.normalize(2, -1);
console.log("normalized result", result);
When I run the model, which calls encoderForward()
, the first issue occured: Setting the token_type_ids
a zeroed Tensor didn't work, because apparently, model_inputs.input_ids.data
was undefined
.
So, why was it undefined
? I noticed that the proxied Tensor
instance's token_type_ids
has a property called dataLocation
now (not location
) and it was set to "cpu"
. Also the data
property wasn't existing, but now there is a cpuData
storing the data
:
I tried my luck with this patch:
And at least creating encoderFeeds.token_type_ids
worked.
Checking the other comments on this issue: https://github.com/microsoft/onnxruntime/issues/20876#issuecomment-2185339554 I realized that I'm not the only one running into this, and checking here, I think this could lead in the same direction. This guy also had an issue right after tokenization and invoking the model, it seems like...
Next step - onnxruntime-web
really doesn't like it's own datastructure:
So I tried my luck with another nasty hack...
But yeah, it doesn't help... the data structure simply seems to have changed in an incompatible way, as after all of that monkey patching of data structures, we get....
As we could see in the screenshot before, the code would access e.data
instead of cpuData
again and this could lead to some .byteLength
of undefined
, potentially. So I tried:
But it did not help...
And here I had enough debugging fun for today... good night xD
Location is not intended to be set because just setting it would not move the data into the right place. Only way to set location is indirectly via 'new ort.Tensor()' or 'ort.Tensor.fromGpuBuffer()'.
I'm not sure how input_ids could ever be not on cpu because the only way to get it to be not on cpu is to call ort.Tensor.fromGpuBuffer() or list an output in session_options.preferredOutputLocation. transformer.js is not calling the first and input_ids would never be an output.
Possible related to the transformers.js Tensor class. When we introduced gpubuffers, the transformers.js Tensor was changed to keep the original ort tensor instead of wrapping the ort tensor (the coder here: https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js#L43).
Let me try your example.
Thank you for your response @guschmue. That makes sense. You can try my example by cloning https://github.com/kyr0/redaktool running bun install && bun run dev
after that, you can load the extension in Chrome or Edge by visiting chrome://extensions/
, enable developer mode and simply "Load unpacked extension". After selecting the folder of the extension code cloned (the folder that has the manifest.json) it will load. You can simply open a new tab of choice or reload an existing one. Opening the service worker from the chrome://extensions/
tab will show the log. Until modelInputs
it works fine. If you uncomment all the code down to https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/model.ts#L69
and run bun run dev
again, re-install the extension code, reload the tab and re-open the service worker log --> you'll find that the issues start to occur. You can find my monkey patching tries here: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L503 and there: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L197 (you may want to comment the latter to restore normal behaviour)
The rest of the transformer.js code is form yesterdays revision of this code base. A simple copy & paste with some imports removed as they collide with Worker limitations.
Thank you in advance for taking a look!
oh, this looks cool. Let me try to get it to work. Bekommen wir schon hin :)
@guschmue Klasse, vielen Dank! :) 💯 I'm also available via Discord 👍 https://discord.gg/4wR9t7cdWc
@guschmue Is there any way how I can help? Please let me know :) I could spend some time this weekend, debugging and fixing things =)
I tested chrome extensions with webgpu and that works fine. Getting to your code soon.
a slightly modified version of your code works for me:
import { env, pipeline, AutoModel, AutoTokenizer } from '@xenova/transformers';
env.localModelPath = 'models/';
env.allowRemoteModels = false;
env.allowLocalModels = true;
env.backends.onnx.wasm.wasmPaths = "/public/";
env.backends.onnx.wasm.proxy = false;
const model_name = 'Xenova/multilingual-e5-small';
const tokenizer = await AutoTokenizer.from_pretrained(model_name);
let model = undefined;
AutoModel.from_pretrained(model_name, {device: 'webgpu'}).then((a) => {
model = a;
self.postMessage({
status: 'ready'
});
});
async function run(input_text) {
const tokens = tokenizer(input_text);
const output = await model(tokens);
console.log(output);
return "done";
}
self.addEventListener('message', async (event) => {
const data = event.data;
const text = data.text;
run(text).then((result) => {
self.postMessage({
status: 'resp',
text: result
});
});
});
and returns:
ot {cpuData: Float32Array(88320), dataLocation: 'cpu', type: 'float32', dims: Array(3), size: 88320}
Thank you @guschmue, I'll try to reproduce on my side and will get back to you soon.
Ha, running into the same issue and found my place here, thank you again @kyr0 <3
@guschmue Hmm.. are you sure that it worked for you with the webgpu
backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that @xenova/transformers
has webgpu
disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28
(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with forwardEncode
on the ONNX runtime, running into several issues here)
Well, currently, with your code, I'm ending up in a no available backend found.
error.
Btw. from_pretrained
isn't processing the device
property either, if I'm not mistaken:
https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208
I was wondering a bit, and checked for the device
symbol in the whole repo and found only docs related code:
https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=code
Also, it would be interesting for me to know how the postMessage
code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to /public/
.
I've changed the build system in my project... so I think from now on I can use a fork of the current main
revision and track down the issue more easily...
@ChTiSh Haha, welcome to the "stuck club" ;)
@kyr0 GitHub search doesn't index branches other than main, so you would need to inspect the code directly. For example, the device is set here:
@xenova Right.. however, the import in the code here was from @xenova/transformers
, so I was assuming that the latest published version, aka @xenova/transformers@2.17.2
.
But he probably has the package locally linked to a build of the v3
branch? I'll re-verify with v3
locally. Sorry for the confusion..
I think V3 is linked with 1.18 onnx webgpu.
On Thu, Jul 18, 2024, 5:38 p.m. Aron Homberg @.***> wrote:
@xenova https://github.com/xenova Right.. however, the import in the example code was from @xenova/transformers, so I was assuming that the latest published version, aka @@.*** was meant.
But he probably has the package locally linked to a build of the (v3)[ https://github.com/xenova/transformers.js/tree/v3] branch? I'll re-verify with v3 locally. Sorry for the confusion..
— Reply to this email directly, view it on GitHub https://github.com/xenova/transformers.js/issues/824#issuecomment-2237815606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWQJC5YTYV74HMURTHKBULTZNBNX5AVCNFSM6AAAAABJ7IASMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXHAYTKNRQGY . You are receiving this because you were mentioned.Message ID: @.***>
@ChTiSh Right. I forked it, checked it out locally, pinned it to 1.19.0-dev.20240621-69d522f4e9
, built it, linked locally and used the code from above. But it attempts to import()
dynamically, and that's not gonna work in a worker extension context:
If there only was an option to import the runtime from user space and also pass down the WASM runtime Module as a Blob. I have it all.. but both libraries (onnxruntime and transformers.js) try hard to fetch()/import()
dynamically.
Maybe I can monkey patch it to provide IoC.. I had a hook working with the main
branch where the internal call to importWasmModule
would be passed up to the userland code so that I could implement env.backends.onnx.importWasmModule
like that:
// copied that over from the `onnxruntime-web`
import getModule from "./transformers/ort-wasm-simd-threaded.jsep";
env.backends.onnx.importWasmModule = async (
mjsPathOverride: string,
wasmPrefixOverride: string,
threading: boolean,
) => {
return [
undefined,
async (moduleArgs = {}) => {
console.log("moduleArgs", moduleArgs); // got called, continued well...
return await getModule(moduleArgs);
},
];
};
But then, the emscripten generated WASM runtime wrapper JS code would still attempt to fetch()
the actual WASM file. Still looking to pass the Blob
or an object URL in via moduleArgs
or trick the backend state to mock the importWasmModule
completely and assign the internal WASM module reference via some other way.
Maybe it should be highlighted again, that this is probably not a problem with "simple" web extensions and their content scripts. I'm talking running it in a service worker of a web extension (background script).
// excerpt from manifest.json
"background": {
"service_worker": "src/worker.ts", // <- Transformers.js is imported here.
"type": "module"
},
This might sound insane. I might be completely hallucinating, but I went through the whole process of force over writing onnx runtime to 1.19, then change the default to resolve the conflict, but at the very end state, I reached the exactly the same outcome with putting nothing related to onnx in service worker except for the simple thread configuration, and literally just have 1 line being '''device: webgpu''' in the instance.
Yeah, the "funny" part is, if you debug it through, will Tensorflow.js/ORT internally actually use the GPU? Because here I am, running the code like that and... have fun checking the screenshots :)
Code:
When I debug from_pretrained()
, we can clearly see that the session is configured to use webgpu
:
It's construction a session...
But here the fun begins... why is the instance a WebAssembly one?
Result:
I guess I'm hallucinating too xD Well, I haven't checked the ORT implementation... maybe the WASM calls through to WebGPU and returns the data via HEAP which is then passed by the runtime back/deserialized as an ONNX Tensor reporting location as cpu
... but it's late, 4:30am again, I'm going to sleep... :)
catching up ... yes, I'm sure I'm using webgpu. My package.json points to a local repo with the transformers.js v3 branch. And I can see in the server worker console logs from webgpu that you can enable with
env.backends.onnx.logLevel = "verbose";
env.backends.onnx.debug = true;
@guschmue Hmm.. are you sure that it worked for you with the
webgpu
backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that@xenova/transformers
haswebgpu
disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with
forwardEncode
on the ONNX runtime, running into several issues here)Well, currently, with your code, I'm ending up in a
no available backend found.
error.Btw.
from_pretrained
isn't processing thedevice
property either, if I'm not mistaken: https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208I was wondering a bit, and checked for the
device
symbol in the whole repo and found only docs related code: https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=codeAlso, it would be interesting for me to know how the
postMessage
code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to/public/
.I've changed the build system in my project... so I think from now on I can use a fork of the current
main
revision and track down the issue more easily...@ChTiSh Haha, welcome to the "stuck club" ;)
q8 isn't going to work with webgpu - webgpu itself doesn't support it yet (but might come). We'd fall back to the wasm op. env.backends.onnx.logLevel = "verbose"; would tell on which device each op landed.
Thank you @guschmue ! That explains the different runtime behaviour.
Well, off-topic limited core functionality qint8 support is growing, and to some extend, available at least in recent versions of Chrome. You can checkout my code to verify: https://github.com/kyr0/fast-dotproduct/blob/main/experiments/dot4U8Packed.js#L29
But yeah, there is no generalized shader-u8
or anything, that's right. There's only shader-f16
for float16
data types:
https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L231
I should have thought about that. Thanks for the heads-up. Now that I think about it, it's obvious.
And man, there is so much potential for optimization in this backend impl.. Somebody probably should rewrite all the looping over data structure in WebAssembly or at least unroll the loop to be JIT-optimizer friendly... https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L532
I demonstrated the gains for using performance optimized code here: https://github.com/kyr0/fast-dotproduct Analyzing this repo's code, I realized that especially https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js and https://github.com/xenova/transformers.js/blob/v3/src/utils/maths.js could also massively benefit from JIT-optimization and a WebAssembly based implementation.
The fast-dotproduct repo also demonstrates how, using emscripten, one can inline the emscripten-generated WASM binary in the runtime file, and the runtime-file in the library file, so that there is no need to load anything dynamically. It's available instantly. WebAssembly nowadays is absolutely evergreen with > 97% in https://caniuse.com/wasm -- I think, there's not even the need to check for the constructor to be available or no :)
Just a few ideas..
ps.: Once a good test coverage would set the baseline for how each algo should work exactly, it would be safe to implement an alternative in WebAssembly without much breaking changes. Currently, the coverage isn't exactly great, but I guess I understand why.. just for such an attempt as of writing an alternative set of implementations, it would really make sense in a pragmatic sense to prevent regressions :)
I'd be willing to start working on optimizing mean pooling / normalization as I need that to be fast for my in-browser vector db just in case there is a consensus on that being a good idea :) (I'm normalizing my locally inferred text embeddings so that a simple dot product would yield me a cosine similarity score as the magnitudes are 1 already; so "insert speed" currently has a bottleneck that is the Transformers.js nomalization and pooling algos)
System Info
vue
Environment/Platform
Description
pipeline(this.task, this.model, { dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken) }, device: 'webgpu', progress_callback, }); I came up when SharedWorker was using webgpu "Error: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly." This problem
Reproduction
worker.js pipeline(this.task, this.model, { dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken) }, device: 'webgpu', progress_callback, });
app.vue
new SharedWorker(new URL('./worker.js', import.meta.url), { type: 'module', });