Issues with prefill & generate

google / jetstream-pytorch

PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"

Apache License 2.0

33 stars 14 forks source link

Open qihqi opened 3 weeks ago

qihqi commented 3 weeks ago

As reported by @tengomucho

Currently there are a few issues with prefill / generate implemention:

Prefill does not use self._sample to do sampling.
Prefill returns a token, so first time generate calls it should return the second generated token, but now it returns the first token again. This is historical but quite unintuitive.