Inference speed is slower when using a GPU with ray than using CPU

When using the textgen example at (https://github.com/alpa-projects/alpa/blob/main/examples/llm_serving/textgen.py), which I add extra code for ray initializing at the beginning as follow. It's observed that using alpa/125m is much slower than facebook/125m. The difference is as follow

if 'alpa' in args.model:
        import ray
        ray.init(namespace="alpa_serve", num_gpus=1)

ray=2.1.0, python=3.8, alpa=0.2.3, cupy-cuda111=12.1.0

Here is how my run script looks like

High: It blocks me from completing my task.

ray-project / ray