ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.89k stars 5.76k forks source link

Inference speed is slower when using a GPU with ray than using CPU #36551

Open Cascol-Chen opened 1 year ago

Cascol-Chen commented 1 year ago

What happened + What you expected to happen

When using the textgen example at (https://github.com/alpa-projects/alpa/blob/main/examples/llm_serving/textgen.py), which I add extra code for ray initializing at the beginning as follow. It's observed that using alpa/125m is much slower than facebook/125m. The difference is as follow

if 'alpa' in args.model:
        import ray
        ray.init(namespace="alpa_serve", num_gpus=1)

using alpa/125m on a single GPU (A800)

image

using facebook/125m on CPU

image

Versions / Dependencies

ray=2.1.0, python=3.8, alpa=0.2.3, cupy-cuda111=12.1.0

Reproduction script

Here is how my run script looks like image

Issue Severity

High: It blocks me from completing my task.

sihanwang41 commented 1 year ago

Hi @Cascol-SCUT , not sure the expected performance, do you see gpu faster without ray?