jina does not pass the right GPU in to clipseg

mchaker commented 1 year ago

Describe the bug

Does not work:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Works:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "6"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Describe how you solve it

I use the numeric GPU ID (sad)

Environment

- jina 3.8.3
- docarray 0.16.2
- jcloud 0.0.35
- jina-hubble-sdk 0.18.0
- jina-proto 0.1.13
- protobuf 3.20.1
- proto-backend cpp
- grpcio 1.47.0
- pyyaml 6.0
- python 3.8.10
- platform Linux
- platform-release 5.15.0-52-generic
- platform-version jina-ai/jina#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
- architecture x86_64
- processor x86_64
- uid 2485377892357
- session-id fcbedcc8-5d43-11ed-9251-0242ac110005
- uptime 2022-11-05T19:56:49.977485
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)

Screenshots

N/A

JoanFM commented 1 year ago

Hey @mchaker ,

What is the backend you are using? what does clipseg do? It seems that the DL backend does not understand the UUID

JoanFM commented 1 year ago

Hey @mchaker ,

Are you sure ur cuda version support MIG access?

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal

In this documentation, you see the drivers version that support this feature, plus the syntax to be used

JoanFM commented 1 year ago

Can you try changing your YAML to:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

or

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

?

mchaker commented 1 year ago

My NVIDIA driver version is 515, so it supports MIG. However, I do not use MIG on my cards. I just use the main card UUID from nvidia-smi -L.

I'll try the MIG prefix and report back.

clipseg is an executor set up for Jina, I use the UUID GPU specification method with other executors and Jina passes the right GPU to the executor. For some reason it does not pass the right GPU to the clipseg executor. :(

JoanFM commented 1 year ago

this is weird, do you have the source code of clipseg? Can you check what is the value in the Executor when u do:

os.environ['CUDA_VISIBLE_DEVICES`]?

What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem.

mchaker commented 1 year ago

I see - will check the os.environ value and report back.

JoanFM commented 1 year ago

Hey @mchaker , any news about it?

mchaker commented 1 year ago

@JoanFM yes - CUDA_VISIBLE_DEVICES is GPU-87d2c7e5-c3eb-1181-1857-368f4c2bbbbb in the container (proper GPU ID)

However Jina crashes with:

⠋ Waiting stablemulti clipseg upscalerp40 realesrgan... ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/6 0:00:18CRITI… clipseg/rep-0@61 can not load the executor from executors/clipseg/config.yml                          [11/11/22 14:54:57]
ERROR  clipseg/rep-0@61 RuntimeError('Attempting to deserialize object on CUDA device 0 but                  [11/11/22 14:54:57]
       torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an
       existing device.') during <class 'jina.serve.runtimes.worker.WorkerRuntime'> initialization
        add "--quiet-error" to suppress the exception details
       Traceback (most recent call last):
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/orchestrate/pods/__init__.py", line
       74, in run
           runtime = runtime_cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 36, in __init__
           super().__init__(args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/asyncio.py", line 80,
       in __init__
           self._loop.run_until_complete(self.async_setup())
         File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
           return future.result()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 101, in async_setup
           self._data_request_handler = DataRequestHandler(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 49, in __init__
           self._load_executor(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 139, in _load_executor
           self._executor: BaseExecutor = BaseExecutor.load_config(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 760, in
       load_config
           obj = JAML.load(tag_yml, substitute=False, runtime_args=runtime_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 174, in load
           r = yaml.load(stream, Loader=get_jina_loader_with_runtime(runtime_args))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
           return loader.get_single_data()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 51, in
       get_single_data
           return self.construct_document(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 55, in
       construct_document
           data = self.construct_object(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 100, in
       construct_object
           data = constructor(self, node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 582, in
       _from_yaml
           return get_parser(cls, version=data.get('version', None)).parse(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/parsers/executor/legacy.py",
       line 45, in parse
           obj = cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/executors/decorators.py", line
       63, in arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/helper.py", line 71, in
       arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/executors/clipseg/executor.py", line 71, in __init__
           torch.load(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
           return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1131, in
       _load
           result = unpickler.load()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1101, in
       persistent_load
           load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1083, in
       load_tensor
           wrap_storage=restore_location(storage, location),
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1055, in
       restore_location
           return default_restore_location(storage, str(map_location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 215, in
       default_restore_location
           result = fn(storage, location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 182, in
       _cuda_deserialize
           device = validate_cuda_device(location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 173, in
       validate_cuda_device
           raise RuntimeError('Attempting to deserialize object on CUDA device '
       RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.
       Please use torch.load with map_location to map your storages to an existing device.
DEBUG  clipseg/rep-0@61 process terminated

JoanFM commented 1 year ago

Hey @mchaker ,

This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there?

mchaker commented 1 year ago

I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom.

JoanFM commented 1 year ago

I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur CUDA_VISIBLE_DEVICES env var is well passed to the Executor.

mchaker commented 1 year ago

I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help!

mchaker commented 1 year ago

@JoanFM actually it looks like the executor is from Jina: https://github.com/jina-ai/dalle-flow/blob/main/executors/clipseg/executor.py

AmericanPresidentJimmyCarter commented 1 year ago

The device for the model is simply mapped with:

        model.load_state_dict(
            torch.load(
                f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
                map_location=torch.device('cuda'),
            ),
            strict=False,
        )

In this case it appears that torch is unable to map the location. @mchaker before these lines in executors/clipseg/executor.py you can add print(os.environ.get('CUDA_VISIBLE_DEVICES))` to see what the environment actually is.

JoanFM commented 1 year ago

Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that CUDA_VISIBLE_DEVICES setting?

AmericanPresidentJimmyCarter commented 1 year ago

@JoanFM No, I will try to get you debug from the env. This appears to be a strange one.

JoanFM commented 1 year ago

I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project

mchaker commented 1 year ago

@AmericanPresidentJimmyCarter what do you need from the env?

JoanFM commented 1 year ago

Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ?

AmericanPresidentJimmyCarter commented 1 year ago

I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists.

mchaker commented 1 year ago

I updated jina using pip install -U jina and the error still happens

RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.

jina-ai / dalle-flow

jina does not pass the right GPU in to clipseg #135