michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip
https://michaelfeil.github.io/infinity/
MIT License
1.31k stars 96 forks source link

Hanging after first embedding generated on MPS #206

Closed semoal closed 4 months ago

semoal commented 5 months ago

System Info

MacOS running with torch or optimum on both happens, with small or big batch size. Model: jinaai/jina-embeddings-v2-base-es

`ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 167, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/infinity/server.py", line 169, in _embeddings
    embedding, usage = await models.embedding_model.embed(data.input)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/infinity_emb/engine.py", line 123, in embed
    embeddings, usage = await self._batch_handler.embed(sentences)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/infinity_emb/inference/batch_handler.py", line 116, in embed
    embeddings, usage = await self._schedule(input_sentences)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/infinity_emb/inference/batch_handler.py", line 201, in _schedule
    result = await asyncio.gather(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sergiomoreno/Projects/infinity-emb/.venv/lib/python3.11/site-packages/infinity_emb/inference/queue.py", line 92, in wait_for_response
    return await item.future
           ^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
INFO:     ::1:56858 - "POST /embeddings HTTP/1.1" 500 Internal Server Error`

Information

Tasks

Reproduction

  1. Run this model
  2. Probably be near of out of memory (not sure about this one but could be the reason)
  3. And try to run another query

I can share privately a repro project where is easy reproducible.

Expected behavior

To not crash

michaelfeil commented 5 months ago

Looks like the inference crashes (e.g. CTRL-C event) - how do you run the model?

From a clean installation on your osx, what are the precise steps to replicate this? Are you using the docker container?

I assume, you are stopping / starting the async event loop incorrectly? Are you using astart / astop similar to the server in server.py?

semoal commented 4 months ago

Yes, stopping the event loop while it was generating was causing to not stop until force-clean the terminal. Fixed try catching on the context manager:

@asynccontextmanager
async def lifespan(app: FastAPI):
    instrumentator.expose(app)
    # Load the ML model
    await models.astart()
    logger.info(docs.startup_message(host="localhost", port="8080", prefix=""))
    try:
        yield
    except asyncio.exceptions.CancelledError:
        pass
    # Clean up the ML models and release the resources
    await models.ateardown()
michaelfeil commented 4 months ago

Awesome