Closed BramVanroy closed 8 months ago
Somewhat related to this, the conversational pipeline's typing and docstrings do not seem correct (which brought me to the issue above):
The signature allows for a list of dicts (as a single conversation) but not a list of list of dicts (a batch of conversations), although List[Conversation]
is allowed. According to the docstrings, a List[dict]
is also not allowed - only Conversation(s). Finally, for compatibility with the pipeline call, other types of input (such as a generator or KeyDataset) should also be allowed but they are not specified.
cc @Narsil
cc @Rocketknight1 for the conversational pipeline docstring part
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@amyeroberts bump
Gentle ping @Rocketknight1
Hi @BramVanroy @amyeroberts - I think the overall issue here is valid, but the ConversationalPipeline
is now deprecated and will be removed in a few versions. That functionality is now part of TextGenerationPipeline
, which (I think!) has a correct docstring. As such, we probably won't bother updating the docs for ConversationalPipeline
before removing it
@Rocketknight1 Perfect - closing this then!
If I may, I think part of BramVanroy's comment still applies: the TextGenerationPipeline still insists on lists as input, not permitting a generator or KeyDataset.
The TextGenerationPipeline documentation reflects this restriction, but the higher-level documentation here does suggest that a generator be possible.
Feature request
Currently the output that you get from a pipeline seems to depend on the input type. While intuitively that makes sense for distinct primitive types, a difference also seems implemented for generators vs lists vs Datasets. I'd argue that that leads to unexpected behavior.
Motivation
We can use batching in any pipeline, which according to the documentation enables "streaming". I interpreted this as: the pipeline will return a generator that will yield output one by one. However, looking at the source code, this does not seem the case.
First of all, it depends on the input format of what is passed to the pipeline. Interestingly, when the passed input type is a list (rather than a Dataset or a Generator), the output is listified:
https://github.com/huggingface/transformers/blob/c48787f347bd604f656c2cfff730e029c8f8c1fe/src/transformers/pipelines/base.py#L1116-L1122
I am not sure why that is the case. The input type can be disconnected from the output type, so why are not all iterables handled in the same manner? Is it to have continuity between input and output types? If that is the case then that is okay, but to me it feels counter-intuitive: if I have a list of samples (like a dataset, just in a list-format), why would that need to be different from a Dataset or Generator as input type?
Small repro:
Your contribution
I do not know what the best option is. It took me quite some digging before I understood what was happening in the output types so I feel that this could be standardized. Personally I'd expect the
PipelineIterator
NOT to be listified. I do not see any reason to wait for all processing to complete, except for continuity with the input type but I don't know if that is important. For backwards compatibility an argument can be added to Pipeline.call,no_listify_for_list
or something like that.