replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.88k stars 549 forks source link

Prediction failed for an unknown reason #1717

Closed tzktok closed 3 months ago

tzktok commented 3 months ago

I have train the yolov model in cog but it shows below error.. @zeke @mattt

Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
{"logger": "cog.server.runner", "timestamp": "2024-06-06T07:43:10.072634Z", "exception": "Traceback (most recent call last):\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py\", line 141, 
in handle_error\n    raise error\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/pool.py\", line 125, in worker\n    result = (True, func(*args, **kwds))\n  File \"/root/.pyenv/versions/3.10.14/lib/python3
.10/site-packages/cog/server/runner.py\", line 371, in predict\n    return _predict(\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py\", line 413, in _predict\n    for event in worker.pre
dict(input_dict, poll=0.1):\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/worker.py\", line 139, in _wait\n    raise FatalWorkerException(\ncog.server.exceptions.FatalWorkerException: Prediction
 failed for an unknown reason. It might have run out of memory? (exitcode -11)", "severity": "ERROR", "message": "caught exception while running prediction"}
{"prediction_id": null, "logger": "uvicorn.error", "timestamp": "2024-06-06T07:43:10.075073Z", "exception": "Traceback (most recent call last):\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/uvicorn/protoco
ls/http/httptools_impl.py\", line 399, in run_asgi\n    result = await app(  # type: ignore[func-returns-value]\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py\", line 70,
 in __call__\n    return await self.app(scope, receive, send)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/fastapi/applications.py\", line 284, in __call__\n    await super().__call__(scope, receive, send
)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/starlette/applications.py\", line 122, in __call__\n    await self.middleware_stack(scope, receive, send)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3
.10/site-packages/starlette/middleware/errors.py\", line 184, in __call__\n    raise exc\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/starlette/middleware/errors.py\", line 162, in __call__\n    await sel
f.app(scope, receive, _send)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/starlette/middleware/exceptions.py\", line 79, in __call__\n    raise exc\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/s
ite-packages/starlette/middleware/exceptions.py\", line 68, in __call__\n    await self.app(scope, receive, sender)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py\", lin
e 20, in __call__\n    raise e\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py\", line 17, in __call__\n    await self.app(scope, receive, send)\n  File \"/root/.pyenv/ve
rsions/3.10.14/lib/python3.10/site-packages/starlette/routing.py\", line 718, in __call__\n    await route.handle(scope, receive, send)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/starlette/routing.py\",
 line 276, in handle\n    await self.app(scope, receive, send)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/starlette/routing.py\", line 66, in app\n    response = await func(request)\n  File \"/root/.pye
nv/versions/3.10.14/lib/python3.10/site-packages/fastapi/routing.py\", line 241, in app\n    raw_response = await run_endpoint_function(\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/fastapi/routing.py\", 
line 167, in run_endpoint_function\n    return await dependant.call(**values)\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/http.py\", line 292, in predict\n    return _predict(\n  File \"/root/
.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/http.py\", line 370, in _predict\n    response = PredictionResponse(**async_result.get().dict())\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/multiprocess
ing/pool.py\", line 774, in get\n    raise self._value\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py\", line 141, in handle_error\n    raise error\n  File \"/root/.pyenv/versions/3.10.
14/lib/python3.10/multiprocessing/pool.py\", line 125, in worker\n    result = (True, func(*args, **kwds))\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py\", line 371, in predict\n    re
turn _predict(\n  File \"/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py\", line 413, in _predict\n    for event in worker.predict(input_dict, poll=0.1):\n  File \"/root/.pyenv/versions/3.10.14/li
b/python3.10/site-packages/cog/server/worker.py\", line 139, in _wait\n    raise FatalWorkerException(\ncog.server.exceptions.FatalWorkerException: Prediction failed for an unknown reason. It might have run out of memory? (exitc
ode -11)", "severity": "ERROR", "message": "Exception in ASGI application\n"}

my cuda deatils below..

Ultralytics YOLOv8.2.28 🚀 Python-3.10.14 torch-2.3.0+cu118 CUDA:0 (NVIDIA GeForce RTX 4090, 24564MiB)

cog.yaml file

build:
  gpu: True
  cuda: "11.8"
  python_version: "3.10"
  system_packages:
    - "cmake"
    - "ffmpeg"
    - "libsm6"
    - "libxext6"
    - "libgl1-mesa-glx"
    - "libglib2.0-0"
  python_packages:
    - "ultralytics==8.2.28"
    - "torch==2.3.0"
    - "requests==2.32.3"
    - "tqdm==4.66.4"
predict: "predict.py:Predictor"
mattt commented 3 months ago

Hi @tzktok. From those error logs:

Prediction failed for an unknown reason. It might have run out of memory?

It sounds like your app crashed due to an OOM error. Please check that your GPU has enough VRAM to run your model.

This doesn't seem to be an issue with Cog, so I'm going to close this for now. Happy to reopen if you think there's something we could be doing differently here.