Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
33.89k
stars
5.76k
forks
source link
Inference speed is slower when using a GPU with ray than using CPU #36551
Open
Cascol-Chen opened 1 year ago
What happened + What you expected to happen
When using the textgen example at (https://github.com/alpa-projects/alpa/blob/main/examples/llm_serving/textgen.py), which I add extra code for ray initializing at the beginning as follow. It's observed that using alpa/125m is much slower than facebook/125m. The difference is as follow
using alpa/125m on a single GPU (A800)
using facebook/125m on CPU
Versions / Dependencies
ray=2.1.0, python=3.8, alpa=0.2.3, cupy-cuda111=12.1.0
Reproduction script
Here is how my run script looks like
Issue Severity
High: It blocks me from completing my task.