huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
11.84k stars 745 forks source link

Excessive Memory consumption. #759

Open iSuslov opened 5 months ago

iSuslov commented 5 months ago

System Info

Environment/Platform

Description

For the latest PHI-3 demo Chrome browser uses: 5.31 Gb for Renderer process and 4.16 Gb for GPU process, totaling almost 10 Gb while running ~2Gb model.

After first inference memory consumption jumps above 12Gb. That can't be normal.

Reproduction

  1. Open demo page https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu
  2. Click on load model
  3. Check on memory consumption.
xenova commented 5 months ago

5.31 Gb for Renderer process

Can you confirm you do not have any other tabs open? I can't seem to understand how this can be related to the application (not much is being rendered).

4.16 Gb for GPU process

This makes sense since it's (most likely) running in fp16 mode.

iSuslov commented 5 months ago

@xenova can confirm:

One empty tab opened:

Screenshot 2024-05-12 at 9 52 24 AM

Navigated to https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu and loaded the model:

Screenshot 2024-05-12 at 9 53 13 AM

This makes sense since it's (most likely) running in fp16 mode.

Can we make it run in lower precision if we run q_4 quantization? (We can't, ONNX doesn't support 4q yet)

To avoid any confusion this is the downloaded model: /Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx 838 mb /Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx_data 1454mb

flatsiedatsie commented 5 months ago

WebGPU currently only supports 16 and 32bit mode.

everythinginjs commented 5 months ago

same here