huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.69k stars 26.94k forks source link

Don't listify batched pipeline output from input list #28047

Closed BramVanroy closed 8 months ago

BramVanroy commented 11 months ago

Feature request

Currently the output that you get from a pipeline seems to depend on the input type. While intuitively that makes sense for distinct primitive types, a difference also seems implemented for generators vs lists vs Datasets. I'd argue that that leads to unexpected behavior.

Motivation

We can use batching in any pipeline, which according to the documentation enables "streaming". I interpreted this as: the pipeline will return a generator that will yield output one by one. However, looking at the source code, this does not seem the case.

First of all, it depends on the input format of what is passed to the pipeline. Interestingly, when the passed input type is a list (rather than a Dataset or a Generator), the output is listified:

https://github.com/huggingface/transformers/blob/c48787f347bd604f656c2cfff730e029c8f8c1fe/src/transformers/pipelines/base.py#L1116-L1122

I am not sure why that is the case. The input type can be disconnected from the output type, so why are not all iterables handled in the same manner? Is it to have continuity between input and output types? If that is the case then that is okay, but to me it feels counter-intuitive: if I have a list of samples (like a dataset, just in a list-format), why would that need to be different from a Dataset or Generator as input type?

Small repro:

from transformers import pipeline

model_name = "microsoft/DialoGPT-small"
pipe = pipeline("conversational", model=model_name, device_map="auto")

list_of_messages = [[{"role": "system", "content": "You're a good assistant!"},
                     {"role": "user", "content": "What is the meaning of 42?"}],
                    [{"role": "user", "content": "What is the meaning of life?"}]]

print(type(pipe(list_of_messages)))
# <class 'list'>

generator_of_msgs = (msg for msg in list_of_messages)
print(type(pipe(generator_of_msgs)))

# <class 'transformers.pipelines.pt_utils.PipelineIterator'>

Your contribution

I do not know what the best option is. It took me quite some digging before I understood what was happening in the output types so I feel that this could be standardized. Personally I'd expect the PipelineIterator NOT to be listified. I do not see any reason to wait for all processing to complete, except for continuity with the input type but I don't know if that is important. For backwards compatibility an argument can be added to Pipeline.call, no_listify_for_list or something like that.

BramVanroy commented 11 months ago

Somewhat related to this, the conversational pipeline's typing and docstrings do not seem correct (which brought me to the issue above):

https://github.com/huggingface/transformers/blob/c48787f347bd604f656c2cfff730e029c8f8c1fe/src/transformers/pipelines/conversational.py#L262-L280

The signature allows for a list of dicts (as a single conversation) but not a list of list of dicts (a batch of conversations), although List[Conversation] is allowed. According to the docstrings, a List[dict] is also not allowed - only Conversation(s). Finally, for compatibility with the pipeline call, other types of input (such as a generator or KeyDataset) should also be allowed but they are not specified.

amyeroberts commented 11 months ago

cc @Narsil

amyeroberts commented 9 months ago

cc @Rocketknight1 for the conversational pipeline docstring part

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

BramVanroy commented 8 months ago

@amyeroberts bump

amyeroberts commented 8 months ago

Gentle ping @Rocketknight1

Rocketknight1 commented 8 months ago

Hi @BramVanroy @amyeroberts - I think the overall issue here is valid, but the ConversationalPipeline is now deprecated and will be removed in a few versions. That functionality is now part of TextGenerationPipeline, which (I think!) has a correct docstring. As such, we probably won't bother updating the docs for ConversationalPipeline before removing it

BramVanroy commented 8 months ago

@Rocketknight1 Perfect - closing this then!

mwestera commented 2 months ago

If I may, I think part of BramVanroy's comment still applies: the TextGenerationPipeline still insists on lists as input, not permitting a generator or KeyDataset.

The TextGenerationPipeline documentation reflects this restriction, but the higher-level documentation here does suggest that a generator be possible.