Open not-lain opened 6 months ago
Hi @not-lain, thanks for opening a feature request!
using tokenizer.apply_chat_template then other stuff then model.generate is pretty repetitive
Could you elaborate on this a bit e.g. with a code snippet? Is is the streaming feature when generating you wish to be able to use?
@amyeroberts normally when someone wants to stream their output (example: https://huggingface.co/spaces/ysharma/Chat_with_Meta_llama3_8b) they need to apply all that code, and this has been quite a repetitive process for AI models, and I thought we can implement this within the transformers library.
I was thinking about integrating this with only text-generation models, but I think we can do that too with image-to-text models.
this is a good resource for that: https://huggingface.co/blog/idefics#getting-started-with-idefics
Thanks for sharing an example!
I'm not sure this is really something we want to add to the pipelines. Pipelines are intended to be simple objects which enable users to get predictions in one line, they're not intended to support all transformers' functionality. In this case, I think it makes sense to leave streaming outside as it enables the user to have full control of the threads and yielding logic.
cc @Rocketknight1 @gante for your thoughts
Yeah, I'm on @amyeroberts's side here - pipelines are (imo) a sort of high-level "on-ramp" API for transformers
, which make it easy for users to quickly get outputs from common workflows. We definitely don't want to pack them full of features to handle every use-case - that's what the lower-level API is for! If we make pipelines very feature-heavy, then they become very big and confusing for new users, which defeats their purpose.
Once users are streaming output and working with threads/yielding/async/etc. they're probably advanced enough that they don't need the pipelines anyway.
Personally would love to have streaming support in pipelines - it’s the one missing feature. Currently, streaming is quite difficult to use, but this would make it so much easier.
FYI: we will be refactoring generate
over the next weeks, including adding a better support for yield
. It may work with pipelines, but it would be a side-effect: as @Rocketknight1 wrote, we don't want to pack too many features there, as it would defeat the point. The pipeline API is not designed to work with async stuff :)
it's ok, I understand.
I will also take a look at the generate
issue, maybe I can help out a little
generate refactor tracker: https://github.com/huggingface/transformers/issues/30810
Feature request
add option to stream output from pipeline
Motivation
using
tokenizer.apply_chat_template
then other stuff thenmodel.generate
is pretty repetitive and I think it's time to integrate this with pipelines, also it's time to add a streaming pipeline too.Your contribution
I can provide this resource as a reference. This is a pr I made with the requested feature https://huggingface.co/google/gemma-1.1-2b-it/discussions/14. another tip I can provide is don't use yield and return in the same function, you should separate them (it's a python problem) sadly I'm a bit busy lately to open a PR, but if I could find some time I'll try to help out.