pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

AMD ROCM support #403

Closed boniek83 closed 3 years ago

boniek83 commented 4 years ago

Are AMD gpus supported? How to start serving models with ROCM?

boniek83 commented 4 years ago

I've tried compiling pytorch with rocm using official instructions: https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learning.html#recommended-install-using-published-pytorch-rocm-docker-image It works:

root@c52468e8ed5b:/# python3.6 -c 'import torch;print("GPU:",torch.cuda.is_available())'
GPU: True
root@c52468e8ed5b:/# python3.6 -c 'import torch;print("DeviceID:",str(torch.cuda.current_device()))'
DeviceID: 0
root@c52468e8ed5b:/# python3.6 -c 'import torch;print("DeviceName:",str(torch.cuda.get_device_name(torch.cuda.current_device())))'
DeviceName: Device 66af

But when trying to serve example model with torchserve it is not seeeing any GPUs:

root@c52468e8ed5b:/# torchserve --foreground --start --ncs --model-store model_store --models densenet161.mar
Removing orphan pid file.
2020-06-01 14:23:20,415 [INFO ] main org.pytorch.serve.ModelServer -
TS Home: /usr/local/lib/python3.6/dist-packages
Current directory: /
Temp directory: /tmp
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 16060 M
Python executable: /usr/bin/python3.6
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Model Store: /model_store
Initial Models: densenet161.mar
Log dir: /logs
Metrics dir: /logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
2020-06-01 14:23:20,443 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: densenet161.mar
2020-06-01 14:23:22,610 [INFO ] main org.pytorch.serve.archive.ModelArchive - model folder already exists: 100ec3accb58b76e403d0a06161e3652ed2caef3
2020-06-01 14:23:22,622 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model densenet161
2020-06-01 14:23:22,622 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model densenet161
2020-06-01 14:23:22,622 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model densenet161 loaded.
2020-06-01 14:23:22,622 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: densenet161, count: 4
2020-06-01 14:23:22,637 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2020-06-01 14:23:22,746 [INFO ] W-9003-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9003
2020-06-01 14:23:22,747 [INFO ] W-9003-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]3738
2020-06-01 14:23:22,748 [DEBUG] W-9003-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9003-densenet161_1.0 State change null -> WORKER_STARTED
2020-06-01 14:23:22,750 [INFO ] W-9003-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-06-01 14:23:22,750 [INFO ] W-9003-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-06-01 14:23:22,751 [INFO ] W-9003-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9003
2020-06-01 14:23:22,764 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9001
2020-06-01 14:23:22,765 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]3740
2020-06-01 14:23:22,765 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-06-01 14:23:22,766 [DEBUG] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-densenet161_1.0 State change null -> WORKER_STARTED
2020-06-01 14:23:22,766 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-06-01 14:23:22,766 [INFO ] W-9001-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-06-01 14:23:22,768 [INFO ] W-9002-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9002
2020-06-01 14:23:22,769 [INFO ] W-9002-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]3741
2020-06-01 14:23:22,769 [DEBUG] W-9002-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9002-densenet161_1.0 State change null -> WORKER_STARTED
2020-06-01 14:23:22,769 [INFO ] W-9002-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-06-01 14:23:22,769 [INFO ] W-9002-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9002
2020-06-01 14:23:22,769 [INFO ] W-9002-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-06-01 14:23:22,794 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
2020-06-01 14:23:22,796 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]3739
2020-06-01 14:23:22,796 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-06-01 14:23:22,796 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change null -> WORKER_STARTED
2020-06-01 14:23:22,796 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-06-01 14:23:22,796 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2020-06-01 14:23:22,799 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2020-06-01 14:23:22,800 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2020-06-01 14:23:22,801 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2020-06-01 14:23:22,802 [INFO ] W-9002-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9002.
2020-06-01 14:23:22,802 [INFO ] W-9001-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-06-01 14:23:22,803 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
2020-06-01 14:23:22,803 [INFO ] W-9003-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9003.
Model server started.
2020-06-01 14:23:22,975 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,976 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:26.363086700439453|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,976 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:47.13031005859375|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,976 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,977 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:60397.58203125|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,977 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:3305.55078125|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:22,978 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:6.0|#Level:Host|#hostname:c52468e8ed5b,timestamp:1591021402
2020-06-01 14:23:25,812 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process die.
2020-06-01 14:23:25,813 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-06-01 14:23:25,814 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/torch_handler/image_classifier.py", line 84, in handle
2020-06-01 14:23:25,815 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     _service.initialize(context)
2020-06-01 14:23:25,815 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/torch_handler/base_handler.py", line 32, in initialize
2020-06-01 14:23:25,815 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
2020-06-01 14:23:25,815 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: Invalid device string: 'cuda:None'
2020-06-01 14:23:25,816 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -
2020-06-01 14:23:25,816 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2020-06-01 14:23:25,816 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -
2020-06-01 14:23:25,816 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-06-01 14:23:25,817 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 163, in <module>
2020-06-01 14:23:25,817 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2020-06-01 14:23:25,817 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 141, in run_server
2020-06-01 14:23:25,817 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-06-01 14:23:25,818 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 105, in handle_connection
2020-06-01 14:23:25,818 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2020-06-01 14:23:25,818 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 83, in load_model
2020-06-01 14:23:25,819 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
2020-06-01 14:23:25,819 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 106, in load
2020-06-01 14:23:25,819 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     entry_point(None, service.context)
2020-06-01 14:23:25,819 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/torch_handler/image_classifier.py", line 95, in handle
2020-06-01 14:23:25,820 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     raise Exception("Please provide a custom handler in the model archive.")
2020-06-01 14:23:25,820 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Exception: Please provide a custom handler in the model archive.
2020-06-01 14:23:25,822 [INFO ] epollEventLoopGroup-4-4 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2020-06-01 14:23:25,824 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
        at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432)
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:128)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
2020-06-01 14:23:25,826 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: densenet161, error: Worker died.
2020-06-01 14:23:25,826 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_STARTED -> WORKER_STOPPED
boniek83 commented 4 years ago

It seems src/main/java/org/pytorch/serve/util/ConfigManager.java has problematic method getAvailableGpu() which only supports nvidia-smi. Similar AMD equivalent would be:

root@c52468e8ed5b:/# rocm-smi -i --csv

device,GPU ID
card0,0x66af

mutli-gpu:

root@d38eedaaa63e:/# rocm-smi -i --csv

device,GPU ID
card1,0x66af
card2,0x66af
card3,0x66af
card4,0x66af

Yes - output is wrapped in new lines for some reason. I've created nvidia-smi bash script that returns what this class expects and everything works! So it seems supporting ROCM is just a matter of adding support for rocm-smi as described above.

riyanah commented 4 years ago

I've tried compiling pytorch with rocm using official instructions: https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learning.html#recommended-install-using-published-pytorch-rocm-docker-image It works:


root@c52468e8ed5b:/# python3.6 -c 'import torch;print("GPU:",torch.cuda.is_available())'
GPU: True
root@c52468e8ed5b:/# python3.6 -c 'import torch;print("DeviceID:",str(torch.cuda.current_device()))'
DeviceID: 0
root@c52468e8ed5b:/# python3.6 -c 'import torch;print("DeviceName:",str(torch.cuda.get_device_name(torch.cuda.current_device())))'
DeviceName: Device 66af

I am having difficulty following the official instructions to install rocm pytorch on my machine. Did you install ROCm v3.3 or v3.5 before pulling the docker image? Because the official instructions say "A ROCm install version 3.3 is required currently." yet its pulling rocm/pytorch:rock3.5 from docker hub.?

boniek83 commented 4 years ago

I'm using ROCk 3.3 (ROCk is kernel driver, ROCm is userland) with rocm/pytorch:rocm3.3_ubuntu16.04_py3.6_pytorch

msaroufim commented 3 years ago

Closing and keeping track of this AMD support in this issue https://github.com/pytorch/serve/issues/740