Error Running cogvlm model on Self-Hosted GPU Server with Roboflow Inference (Transformer Version)

YoungjaeDev commented 5 months ago

Search before asking

[X] I have searched the Inference issues and found no similar bug report.

Bug

I'm encountering an issue while attempting to deploy the cogvlm model on my own GPU server using Roboflow inference code. The server setup seems to be correct, but when I try to run the model, I run into the following error:

  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ):
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    model._is_quantized_training_enabled = True
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 802, in _load_state_dict_into_meta_model
    state_dict_index = offload_weight(param, param_name, state_dict_folder, state_dict_index)
  File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 124, in check_quantized_param
KeyError: 'inv_freq'
INFO:     172.17.0.1:45116 - "POST /llm/cogvlm HTTP/1.1" 500 Internal Server Error

Upon further investigation and based on this GitHub issue (https://github.com/THUDM/CogVLM/issues/396), it's recommended to downgrade the transformers library to version 4.37 due to compatibility issues. However, the current deployment is using version 4.38. Could you please confirm if the transformers version could be the source of this issue and if downgrading would be appropriate? Any other insights or suggestions would also be greatly appreciated. Thank you!

Environment

inference 0.9.20 inference-cli 0.9.20 inference-gpu 0.9.20 inference-sdk 0.9.20

x86-gpu(rtx3090)

Minimal Reproducible Example

cog-vlm-client$ python script.py --image "data/tire.jpg" --prompt "read serial number from tire"

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

PawelPeczek-Roboflow commented 5 months ago

Hi,

Thanks for raising the issue. It is likely that you are right about the source of the bug. Will take a look next week to see if downgrade helps and if it does, we will fix the problem given it does not create another.

YoungjaeDev commented 5 months ago

@PawelPeczek-Roboflow

I think the version of the transformers package in dockerfile-gpu related to roboflow inference is the offending version. I'd like to lower it and give it a try, but can you check that first?

PawelPeczek-Roboflow commented 5 months ago

ok, checked that this fix work on my end: https://github.com/roboflow/inference/pull/363

We need to ship this with the next release, but for the time being you can build docker image on your end:

git clone git@github.com:roboflow/inference.git
cd inference 
docker build --build-arg="TARGETPLATFORM=linux/amd64" -t roboflow/roboflow-inference-server-gpu:dev -f docker/dockerfiles/Dockerfile.onnx.gpu .

To run the server:

docker run --gpus all roboflow/roboflow-inference-server-gpu:dev

BChip commented 5 months ago

ok, checked that this fix work on my end: #363

We need to ship this with the next release, but for the time being you can build docker image on your end:
git clone git@github.com:roboflow/inference.git
cd inference 
docker build --build-arg="TARGETPLATFORM=linux/amd64" -t roboflow/roboflow-inference-server-gpu:dev -f docker/dockerfiles/Dockerfile.onnx.gpu .
To run the server:
docker run --gpus all roboflow/roboflow-inference-server-gpu:dev

I have the same issue as @YoungjaeDev and @PawelPeczek-Roboflow suggestion above got me the same output, but with an additional error about WithFixedSizeCache.

bchip@brad-sff:~/inference$ docker run -p 9001:9001 --gpus all roboflow/roboflow-inference-server-gpu:dev
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9001 (Press CTRL+C to quit)
A new version of the following files was downloaded from https://huggingface.co/THUDM/cogvlm-chat-hf:
- configuration_cogvlm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/THUDM/cogvlm-chat-hf:
- visual.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/THUDM/cogvlm-chat-hf:
- modeling_cogvlm.py
- visual.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|██████████| 8/8 [06:40<00:00, 50.02s/it]
Loading checkpoint shards:   0%|          | 0/8 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/app/inference/core/interfaces/http/http_api.py", line 179, in wrapped_route
    return await route(*args, **kwargs)
  File "/app/inference/core/interfaces/http/http_api.py", line 1266, in cog_vlm
    cog_model_id = load_cogvlm_model(inference_request, api_key=api_key)
  File "/app/inference/core/interfaces/http/http_api.py", line 476, in load_core_model
    self.model_manager.add_model(core_model_id, inference_request.api_key)
  File "/app/inference/core/managers/decorators/fixed_size_cache.py", line 61, in add_model
    raise error
  File "/app/inference/core/managers/decorators/fixed_size_cache.py", line 55, in add_model
    return super().add_model(model_id, api_key, model_id_alias=model_id_alias)
  File "/app/inference/core/managers/decorators/base.py", line 62, in add_model
    self.model_manager.add_model(model_id, api_key, model_id_alias=model_id_alias)
  File "/app/inference/core/managers/base.py", line 61, in add_model
    model = self.model_registry.get_model(resolved_identifier, api_key)(
  File "/app/inference/models/cogvlm/cogvlm.py", line 39, in __init__
    self.model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 556, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 802, in _load_state_dict_into_meta_model
    or (not hf_quantizer.check_quantized_param(model, param, param_name, state_dict))
  File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 124, in check_quantized_param
    if isinstance(module._parameters[tensor_name], bnb.nn.Params4bit):
KeyError: 'inv_freq'
INFO:     172.17.0.1:59332 - "POST /llm/cogvlm HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 87, in collapse_excgroups
  |     yield
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 190, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
    |     result = await app(  # type: ignore[func-returns-value]
    |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    |     return await self.app(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    |     await super().__call__(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    |     await self.middleware_stack(scope, receive, send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    |     await self.app(scope, receive, _send)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
    |     with collapse_excgroups():
    |   File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    |     self.gen.throw(typ, value, traceback)
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups
    |     raise exc
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 191, in __call__
    |     response = await self.dispatch_func(request, call_next)
    |   File "/app/inference/core/interfaces/http/http_api.py", line 403, in count_errors
    |     self.model_manager.num_errors += 1
    | AttributeError: 'WithFixedSizeCache' object has no attribute 'num_errors'
    +------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__
    with collapse_excgroups():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 191, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/app/inference/core/interfaces/http/http_api.py", line 403, in count_errors
    self.model_manager.num_errors += 1
AttributeError: 'WithFixedSizeCache' object has no attribute 'num_errors'

grzegorz-roboflow commented 5 months ago

Hi @BChip, thank you for running the test, I have pushed small change to this PR and then followed your test steps, I see no error and additionally I can confirm transformers version is now bound to 4.37.2. I will close this issue if there are no further problems concerning transformers version reported.

BChip commented 5 months ago

@grzegorz-roboflow Awesome, just tried it and it works! When will this be mainstream fixed?

Thank you!!!!!

PawelPeczek-Roboflow commented 5 months ago

@BChip - that will be shipped to dockerhub with next release - which I believe would be done as soon as we close and test this PR: https://github.com/roboflow/inference/pull/343 which consumes big part of our time and capacity now. I would expect it to be somewhere in the next 2 weeks. If you need temporary solution we may try to push special tag with build @grzegorz-roboflow did for you And btw - seems like the error described here: https://github.com/roboflow/inference/issues/355#issuecomment-2084469048 is reporting a bug of a separate kind that we also need to take a look, so thanks a lot for reporting

roboflow / inference