Closed ruanrz closed 1 year ago
Hey @ruanrz ,
can you give us the output of this command?
nvidia-smi -L
Also, what is the value of the CUDA_TOTAL_DEVICES
environment variable?
can u do this in ur Executor and tell us what is printed?
class ZRExecutor(Executor):
def __init__(self,**kwargs):
super().__init__(**kwargs)
import os
print(f' ENVIRONMENT VARIABLE CUDA_VISIBLE_DEVICES: {os.environ["CUDA_VISIBLE_DEVICES"]}')
can u do this in ur Executor and tell us what is printed?
@JoanFM
nvidia-smi -L
GPU 0: NVIDIA RTX A5000 (UUID: GPU-9fcbfc3e-c4c8-ecfc-9a2c-ec9dc1d2bdc1)
GPU 1: NVIDIA RTX A5000 (UUID: GPU-076c3ca9-0750-1f41-e8ea-5f0433ae2cc2)
GPU 2: NVIDIA RTX A5000 (UUID: GPU-570317ab-d3de-f466-da1e-cfccb6a5b75f)
and CUDA_TOTAL_DEVICES
environment variable is not defined
ENVIRONMENT VARIABLE CUDA_VISIBLE_DEVICES: 1
torch.cuda.device_count() 3
torch.cuda.current_device() 0
ENVIRONMENT VARIABLE CUDA_VISIBLE_DEVICES: 2
torch.cuda.device_count() 3
torch.cuda.current_device() 0
ENVIRONMENT VARIABLE CUDA_VISIBLE_DEVICES: 0
torch.cuda.device_count() 3
torch.cuda.current_device() 0
and all models still running on GPU 0 by checking nvidia-smi
VRAM usage
This is weird that the current_device does not respect the environment variable in that speciric Executor, there seems to be something weird with the torch usage. We will check on our end as well
I also encountered the same problem, f=Flow (protocol='http '). add (uses=MyExecutor, replicas=3, env={"CUDA_VISIBLE_DEVICES": "RR"}), but all replicas are still running on the 0 th graphics card
Hello @ruanrz, @fqzhao-win ,
I believe what happens is that torch
is imported before the Executor starts and this is why the CUDA_VISIBLE_DEVICES
takes place.
Would u try doing these things (one or the other)
1- Refactor ur code to separate the Executor and the Flow and include the Executor from a module or a file (https://docs.jina.ai/concepts/flow/add-executors/#define-executor-with-uses)
2- Keep it as it is, but hide the import torch
inside the method where this is needed, so that the module is not imported from the start and affectes the new processes of Executors
I believe this should solve your issues.
Hello @ruanrz , @fqzhao-win, have u tried the suggested alternative?
Hi @JoanFM Soory for the late reply,I tried the second suggestion,and installed the latest version of Jina. It runs successfully. here is my code:
from jina import Executor, requests,Flow
import time
class ZRExecutor(Executor):
def __init__(self,**kwargs):
super().__init__(**kwargs)
import os
print(f' ENVIRONMENT VARIABLE CUDA_VISIBLE_DEVICES: {os.environ["CUDA_VISIBLE_DEVICES"]}')
import torch
print('torch.cuda.device_count()',torch.cuda.device_count())
print('torch.cuda.current_device()',torch.cuda.current_device())
from diffusers import DiffusionPipeline,EulerAncestralDiscreteScheduler
from diffusers import DPMSolverMultistepScheduler
model_path = "#####"
lms = EulerAncestralDiscreteScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear"
)
pipe = DiffusionPipeline.from_pretrained(
model_path,
cache_dir="./huggingface",
resume_download=True,
custom_pipeline="lpw_stable_diffusion",
torch_dtype=torch.float16,
scheduler=lms,
use_auth_token="#####",
safety_checker=None,
)
pipe.to("cuda")
def main():
f = Flow().add(uses=ZRExecutor,name='testens',replicas=2,env={"CUDA_VISIBLE_DEVICES":"RR"})
with f:
f.block()
if __name__ == '__main__':
main()
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:01:00.0 Off | Off |
| 30% 26C P8 16W / 230W | 2758MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:25:00.0 Off | Off |
| 30% 26C P8 15W / 230W | 2760MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Thank you for your patient advice and guidance
I tried to deploy multiple replica wtih multiple GPUs, but
CUDA_VISIBLE_DEVICES=RR
andenv={"CUDA_VISIBLE_DEVICES":"RR"}
do not work as document said.Code
it will raise error
If I remove
CUDA_VISIBLE_DEVICES=RR
and run the code, it will run successfully. However I checked the GPU usage, it seems that all models are running on GPU 0. And the script outputtorch.cuda.current_device() 0
three times.Jina version