Closed semoal closed 7 months ago
@semoal Seems to be an error with jinaai/jina-bert-v2-qk-devlin-norm-1e-2
in combination with torch.compile(model,dynamic=True)
Two things you should do now:
export INFINITY_DISABLE_COMPILE=True
also Confirmed that disabling the compile works Michael, thanks for the quick feedback. Created an issue on Jina HF board https://huggingface.co/jinaai/jina-embeddings-v2-base-es/discussions/6
thanks @michaelfeil and @semoal , we're looking into it!
@bwanglzu First pointer might be that torch inductor does not like the pythonic implementation of start = 2 ** (-(2 ** -(math.log2(n) - 3)))
perhaps it can be torchified without a performance sacrifice. https://huggingface.co/jinaai/jina-bert-implementation/blob/f3ec4cf7de7e561007f27c9efc7148b0bd713f81/modeling_bert.py#L720 - also without dynamic=True
might be an idea. torch.compile gives a decent +15% throughput - might be a shame to drop it.
@semoal I might start to consolidate the inference engine options and introduce additional arguments, so that you dont have to deal with ENV variables (but CLI arguments instead) - would that be helpful for you?
but you're right, let me rewrite with torch format :)
hi @semoal and @michaelfeil error should have been fixed and please give a try. (note: please remove huggingface cache in ~/.cache/huggingface/hub
and ~./cache/huggingface/modules
my script:
import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from infinity_emb.transformer.utils import InferenceEngine
from infinity_emb.primitives import Device
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
embeddings_args = EngineArgs(
model_name_or_path="jinaai/jina-embeddings-v2-base-es",
engine=InferenceEngine.torch,
device=Device.auto,
trust_remote_code=True,
)
engine = AsyncEmbeddingEngine.from_args(embeddings_args)
async def main():
async with engine: # engine starts with engine.astart()'
embeddings, usage = await engine.embed(sentences=sentences)
print(embeddings)
asyncio.run(main())
Just tried and now doesn't crash when receiving requests and correctly generates the embedding but I see a warning/error when initializing the model:
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,814 infinity_emb INFO: Adding acceleration.py:20
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] optimizations via Huggingface optimum. Disable by
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] setting the env var `INFINITY_DISABLE_OPTIMUM`
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] ERROR 2024-02-27 17:13:25,818 infinity_emb ERROR: acceleration.py:27
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] BetterTransformer failed with The transformation of
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] the model JinaBertModel to BetterTransformer failed
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] while it should not. Please fill a bug report or
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] open a PR to support this model at
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Traceback (most recent call last):
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/infinity_em
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] b/transformer/acceleration.py", line 25, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] to_bettertransformer
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] model = BetterTransformer.transform(model)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File "/usr/lib/python3.10/contextlib.py", line 79,
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] in inner
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] return func(*args, **kwds)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 270, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] transform
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer(model_fast)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 166, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] raise Exception(
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Exception: The transformation of the model
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] JinaBertModel to BetterTransformer failed while it
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] should not. Please fill a bug report or open a PR to
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] support this model at
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,825 infinity_emb INFO: sentence_transformer.py:67
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Switching to half() precision (cuda: fp16).
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Disable by the setting the env var
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] `INFINITY_DISABLE_HALF`
2024-02-27T17:13:25.853 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,852 infinity_emb INFO: sentence_transformer.py:73
2024-02-27T17:13:25.853 app[17816011be4689] ord [info] using torch.compile()
2024-02-27T17:13:27.823 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,821 infinity_emb INFO: batch_handler.py:385
2024-02-27T17:13:27.823 app[17816011be4689] ord [info] creating batching engine
2024-02-27T17:13:27.825 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,823 infinity_emb INFO: ready batch_handler.py:242
2024-02-27T17:13:27.825 app[17816011be4689] ord [info] to batch requests.
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,825 infinity_emb INFO: server.py:49
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] ♾️ Infinity - Embedding Inference Server
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] MIT License; Copyright (c) 2023 Michael Feil
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Version 0.0.25
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Open the Docs via Swagger UI:
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] http://localhost:8000/docs
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Access model via 'GET':
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] curl http://localhost:8000/models
2024-02-27T17:13:27.829 app[17816011be4689] ord [info] INFO: Application startup complete.
2024-02-27T17:13:27.830 app[17816011be4689] ord [info] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
@semoal Can you confirm this works when using with os.environ
: INFINITY_DISABLE_OPTIMUM="TRUE"
(for BetterTransformer) and INFINITY_DISABLE_COMPILE="TRUE"
NOTE: FOR releases < 0.0.40 the interface for environment variables has changed.
Yes, disabling works perfectly, even I noticed one curious thing, when disabling the optimizations the first request to /embeddings is much faster (much like 10-20s less, probably it's related to i'm not warming-up the model, should I?) than with optimizations enabled. This is all tested in a fly.io machine with an L40 .
Yes, its the jit nature of torch.compile - please enable the warmup flag for that
We're trying to run this jina embed model with Infinity:
This is our nvidia-smi:
Crash:
Any suggerence @michaelfeil