The generated results are different when using greedy search during generation

FrostML commented 1 year ago

Thank you very much for your work. I got a problem when I ran BLOOM-176B on 8*A100.

I followed the README.md and executed the following command. To be specific, I set do_sample = true and top_k = 1 which I thought it was equivalent to greedy search:

python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": true, "top_k": 1}'

However, the generated outputs of several forwards were different with the same inputs. This situation happened occasionally.

Do you have any clues or ideas about this?

My env info:

CUDA 11.7
nccl 2.14.3

accelerate 0.17.1
Flask 2.2.3
Flask-API 3.0.post1
gunicorn 20.1.0
pydantic 1.10.6
huggingface-hub 0.13.2

mayank31398 commented 1 year ago

Hi, do_sample = true and top_k = 1 should be fine but the correct way to do it is just do_sample = False. This is weird. I don't this is a bug in the code in this repository. But will try to give it a shot. Can you try with just do_sample = False?

FrostML commented 1 year ago

Hi @mayank31398 Sorry for the late reply. It was ok with do_sample=False. The results were all the same. But I still can't figure out why sampling can't work properly. Do you know who or which repo I can turn to for some help?

richarddwang commented 1 year ago

Refer to https://huggingface.co/blog/how-to-generate. Because sampling is designed to incorporate randomness into picking the next word.

FrostML commented 1 year ago

But the k is 1. There shouldn't be any randomness. @richarddwang

huggingface / transformers-bloom-inference

The generated results are different when using greedy search during generation #65