michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.github.io/infinity/
MIT License
1.06k stars 75 forks source link

Torch + Cuda + Bert crashes abruptly on startup #115

Closed semoal closed 3 months ago

semoal commented 4 months ago

We're trying to run this jina embed model with Infinity:

embeddings_args = EngineArgs(
    model_name_or_path="jinaai/jina-embeddings-v2-base-es",
    engine=InferenceEngine.torch,
    device=Device.auto,
    trust_remote_code=True,
)

This is our nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   31C    P0              70W / 500W |   1885MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       439      C   /usr/bin/python3.10                        1872MiB |
+---------------------------------------------------------------------------------------+

Crash:


hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)

ERROR 2024-02-26 20:42:13,177 infinity_emb ERROR: Failed batch_handler.py:320

running call_function <built-in function

log2>(*(s0**3,), **{}):

must be real number, not SymFloat

from user code:

File

"/data/hf/modules/transformers_modules/jinaai/jina

-bert-v2-qk-devlin-norm-1e-2/a0ba9b2e7e2613a74d8cb

a43f2bbd420699db17c/modeling_bert.py", line 728,

in resume_in__get_alibi_head_slopes

get_slopes_power_of_2(closest_power_of_2)

File

"/data/hf/modules/transformers_modules/jinaai/jina

-bert-v2-qk-devlin-norm-1e-2/a0ba9b2e7e2613a74d8cb

a43f2bbd420699db17c/modeling_bert.py", line 715,

in get_slopes_power_of_2

start = 2 ** (-(2 ** -(math.log2(n) - 3)))

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1

Any suggerence @michaelfeil

michaelfeil commented 4 months ago

@semoal Seems to be an error with jinaai/jina-bert-v2-qk-devlin-norm-1e-2 in combination with torch.compile(model,dynamic=True)

Two things you should do now:

  1. Set export INFINITY_DISABLE_COMPILE=True also
  2. open an issue at jina, reminding them that torch.compile fails for their custom modeling code.
semoal commented 4 months ago

Confirmed that disabling the compile works Michael, thanks for the quick feedback. Created an issue on Jina HF board https://huggingface.co/jinaai/jina-embeddings-v2-base-es/discussions/6

bwanglzu commented 4 months ago

thanks @michaelfeil and @semoal , we're looking into it!

michaelfeil commented 4 months ago

@bwanglzu First pointer might be that torch inductor does not like the pythonic implementation of start = 2 ** (-(2 ** -(math.log2(n) - 3))) perhaps it can be torchified without a performance sacrifice. https://huggingface.co/jinaai/jina-bert-implementation/blob/f3ec4cf7de7e561007f27c9efc7148b0bd713f81/modeling_bert.py#L720 - also without dynamic=True might be an idea. torch.compile gives a decent +15% throughput - might be a shame to drop it.

@semoal I might start to consolidate the inference engine options and introduce additional arguments, so that you dont have to deal with ENV variables (but CLI arguments instead) - would that be helpful for you?

bwanglzu commented 4 months ago

but you're right, let me rewrite with torch format :)

bwanglzu commented 4 months ago

hi @semoal and @michaelfeil error should have been fixed and please give a try. (note: please remove huggingface cache in ~/.cache/huggingface/hub and ~./cache/huggingface/modules

my script:

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from infinity_emb.transformer.utils import InferenceEngine
from infinity_emb.primitives import Device

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
embeddings_args = EngineArgs(
    model_name_or_path="jinaai/jina-embeddings-v2-base-es",
    engine=InferenceEngine.torch,
    device=Device.auto,
    trust_remote_code=True,
)
engine = AsyncEmbeddingEngine.from_args(embeddings_args)

async def main(): 
    async with engine: # engine starts with engine.astart()'
        embeddings, usage = await engine.embed(sentences=sentences)
        print(embeddings)
asyncio.run(main())
semoal commented 4 months ago

Just tried and now doesn't crash when receiving requests and correctly generates the embedding but I see a warning/error when initializing the model:

2024-02-27T17:13:25.816 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,814 infinity_emb INFO: Adding acceleration.py:20

2024-02-27T17:13:25.816 app[17816011be4689] ord [info] optimizations via Huggingface optimum. Disable by

2024-02-27T17:13:25.816 app[17816011be4689] ord [info] setting the env var `INFINITY_DISABLE_OPTIMUM`

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] ERROR 2024-02-27 17:13:25,818 infinity_emb ERROR: acceleration.py:27

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] BetterTransformer failed with The transformation of

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] the model JinaBertModel to BetterTransformer failed

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] while it should not. Please fill a bug report or

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] open a PR to support this model at

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Traceback (most recent call last):

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/infinity_em

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] b/transformer/acceleration.py", line 25, in

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] to_bettertransformer

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] model = BetterTransformer.transform(model)

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File "/usr/lib/python3.10/contextlib.py", line 79,

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] in inner

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] return func(*args, **kwds)

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 270, in

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] transform

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer(model_fast)

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 166, in

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] raise Exception(

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Exception: The transformation of the model

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] JinaBertModel to BetterTransformer failed while it

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] should not. Please fill a bug report or open a PR to

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] support this model at

2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/

2024-02-27T17:13:25.827 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,825 infinity_emb INFO: sentence_transformer.py:67

2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Switching to half() precision (cuda: fp16).

2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Disable by the setting the env var

2024-02-27T17:13:25.827 app[17816011be4689] ord [info] `INFINITY_DISABLE_HALF`

2024-02-27T17:13:25.853 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,852 infinity_emb INFO: sentence_transformer.py:73

2024-02-27T17:13:25.853 app[17816011be4689] ord [info] using torch.compile()

2024-02-27T17:13:27.823 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,821 infinity_emb INFO: batch_handler.py:385

2024-02-27T17:13:27.823 app[17816011be4689] ord [info] creating batching engine

2024-02-27T17:13:27.825 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,823 infinity_emb INFO: ready batch_handler.py:242

2024-02-27T17:13:27.825 app[17816011be4689] ord [info] to batch requests.

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,825 infinity_emb INFO: server.py:49

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] ♾️ Infinity - Embedding Inference Server

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] MIT License; Copyright (c) 2023 Michael Feil

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Version 0.0.25

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Open the Docs via Swagger UI:

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] http://localhost:8000/docs

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Access model via 'GET':

2024-02-27T17:13:27.828 app[17816011be4689] ord [info] curl http://localhost:8000/models

2024-02-27T17:13:27.829 app[17816011be4689] ord [info] INFO: Application startup complete.

2024-02-27T17:13:27.830 app[17816011be4689] ord [info] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
michaelfeil commented 4 months ago

@semoal Can you confirm this works when using with os.environ: INFINITY_DISABLE_OPTIMUM="TRUE" (for BetterTransformer) and INFINITY_DISABLE_COMPILE="TRUE"

semoal commented 4 months ago

Yes, disabling works perfectly, even I noticed one curious thing, when disabling the optimizations the first request to /embeddings is much faster (much like 10-20s less, probably it's related to i'm not warming-up the model, should I?) than with optimizations enabled. This is all tested in a fly.io machine with an L40 .

michaelfeil commented 4 months ago

Yes, its the jit nature of torch.compile - please enable the warmup flag for that