microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.1k stars 1.04k forks source link

run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely #896

Closed awan-10 closed 4 months ago

awan-10 commented 6 months ago

When I modified “run_example.sh” and changed backend to vllm,

I got the error message down below, I will do some some check whether the error comes from server side or client side.

I notice this benchmark has three modes: “mii”, “vllm”, “aml”, in which mii and vllm is serving frame work and aml mode corresponding to benchmark an API server on Azure. Is it possible to run this script to benchmark a local API server? I kind of thinking run vllm serving in separate command, and use this benchmark to test the api server vllm started. So I would have better control on how the vllm server started and see all the error message from vllm server if it fails.

(vllm) [gma@spr02 mii]$ bash ./run_vllm.sh

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Token indices sequence length is longer than the specified maximum sequence length for this model (5883 > 4096). Running this sequence through the model will result in indexing errors

warmup queue size: 37 (1070543)

Process Process-1:

Traceback (most recent call last):

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/requests/models.py", line 816, in generate

yield from self.raw.stream(chunk_size, decode_content=True)

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1040, in stream

yield from self.read_chunked(amt, decode_content=decode_content)

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1184, in read_chunked

self._update_chunk_length()

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1119, in _update_chunk_length

raise ProtocolError("Response ended prematurely") from None

urllib3.exceptions.ProtocolError: Response ended prematurely

awan-10 commented 6 months ago

@delock - FYI. Created this issue so we can track and fix it. Please work with folks assigned on this issue.

lekurile commented 6 months ago

Hello @delock,

Thank you for raising this issue. I ran a local vllm benchmark with the microsoft/Phi-3-mini-4k-instruct model using the following code:

# Run benchmark
python ./run_benchmark.py \
        --model microsoft/Phi-3-mini-4k-instruct \
        --tp_size 1 \
        --num_replicas 1 \
        --max_ragged_batch_size 768 \
        --mean_prompt_length 2600 \
        --mean_max_new_tokens 60 \
        --stream \
        --backend vllm \
        --overwrite_results \

### Gernerate the plots
python ./src/plot_th_lat.py --data_dirs results_vllm/

echo "Find figures in ./plots/ and log outputs in ./results/"

I also had to add the "--trust-remote-code", argument to the vllm_cmd here: https://github.com/microsoft/DeepSpeedExamples/blob/1be0fc77a62ef965e2dea920789f7df95a843820/benchmarks/inference/mii/src/server.py#L39

Here's the resulting plot:

To reproduce the issue you show above, can you please provide a reproduction script so I can test on my end?

To answer your question:

Is it possible to run this script to benchmark a local API server? I kind of thinking run vllm serving in separate command, and use this benchmark to test the api server vllm started. So I would have better control on how the vllm server started and see all the error message from vllm server if it fails.

We can update the benchmarking script and add an additional argument, where existing local server information is provided and the script will not stand up a new server, but will instead target the existing server using the information provided.

delock commented 6 months ago

@awan-10 @lekurile Thanks for start this thread. I met this error when I tried to run this example on Xeon server with CPU. I suspect this is a configuration issue. Currently, I plan to modify the script to run client code only, and start the server on seperate command line, so will be able to see more error message and get better understanding.

delock commented 6 months ago

Hi @lekurile Now I can start the server from seperate command line and run benchmark on this server with reduced test size (max batch 128, avg prompt128) to start with.

Yet I met the following error during post processing, I suspect this is due to transformers version. What is the transformers version you are using? My version is transformers==4.40.1

Traceback (most recent call last):
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 44, in <module>
    run_benchmark()
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 36, in run_benchmark
    print_summary(client_args, response_details)
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/utils.py", line 235, in print_summary
    ps = get_summary(vars(args), response_details)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 80, in get_summary
    [
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 81, in <listcomp>
    (len(get_tokenizer().tokenize(r.prompt)) + len(get_tokenizer().tokenize(r.generated_tokens)))
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 396, in tokenize
    return self.encode_plus(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3037, in encode_plus
    return self._encode_plus(
           ^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
    batched_output = self._batch_encode_plus(
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
lekurile commented 6 months ago

Hi @delock,

I'm using transformers==4.40.1 as well.

After https://github.com/microsoft/DeepSpeedExamples/pull/895 was committed to the repo, I'm seeing the same error on my end as well.

  File "/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Can you please try detaching your repo HEAD to fab5d06, one commit prior, and running again? I'll look into this PR and see if we need to revert or not.

Thanks, Lev

lekurile commented 6 months ago

@delock, here's the PR fixing the tokens_per_sec metric to work for both the streaming and non-streaming cases: https://github.com/microsoft/DeepSpeedExamples/pull/897

You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.

delock commented 6 months ago

Yes, the latest version can going forward. Will see whether it can continue.

@delock, here's the PR fixing the tokens_per_sec metric to work for both the streaming and non-streaming cases: #897

You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.

delock commented 6 months ago

Hi @lekurile the benchmark will proceed but will hit some other error when running on CPU. I'll check with vllm cpu engineers to investigate these errors. I also submitted a PR adding a flag allowing start the server in seperate command line. https://github.com/microsoft/DeepSpeedExamples/pull/900

loadams commented 4 months ago

Thanks @delock - can we close this issue for now?

delock commented 4 months ago

Thanks @delock - can we close this issue for now?

Yes, this is no longer an issue now, thanks!

loadams commented 4 months ago

Thanks!