Closed bhacquin closed 1 year ago
Out : {'sequence': 'Love is Where It All Begins: adidas X Thebe Magugu Launch .... Herzogenaurach, Aug 15 2022 – Today, adidas launches its latest Tennis collection, created in partnership with contemporary South African...', 'labels': ['advertisement', 'global warming'], 'scores': [0.9311832189559937, 0.0002945002052001655]} Expected behavior
As I am using batch_size 32, I do expect my output to be a sequence of dicts of length 32. However, it only returns the first element each and every time.
Actually everything is working as intended. The model is indeed seeing 32 items at a time, however 32 texts in this occurence is NOT a batch of 32.
In order to work on this data, there is 1 text pair constructed for each text + candidate_labels, so in your case each text generates 2 items to be processed by the model.
This pipeline is actually quite smart, and starts by outputting in a generating fashion all the items one by one, which are automatically batched (regardless if it's the same text or not) into a batch of 32 (so here 16 texts x 2 candidate labels but it would work the same with any amount of candidate labels)
It then proceeds and run the model on this batch
Then the output is iteratively debatched, to be processed 1 texts + candidate_labels at a time, yielding the exact same output as if it wasn't batched (but it was indeed batched yielding performance speedups if used on the appropriate GPU for instance ).
Does that answer your question ?
More info there: https://huggingface.co/docs/transformers/v4.22.2/en/main_classes/pipelines#pipeline-chunk-batching https://github.com/huggingface/transformers/pull/14225
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.21.3Who can help?
@Narsil
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
class TextDataset(Dataset):
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0, framework='pt')
candidate_labels = ['advertisement','politics'] dataset = TextDataset(news_list)
for i in classifier(KeyDataset(dataset, 'text'),candidate_labels=candidate_labels, batch_size=32): print(i) break
Out : {'sequence': 'Love is Where It All Begins: adidas X Thebe Magugu Launch .... Herzogenaurach, Aug 15 2022 – Today, adidas launches its latest Tennis collection, created in partnership with contemporary South African...', 'labels': ['advertisement', 'global warming'], 'scores': [0.9311832189559937, 0.0002945002052001655]}
Expected behavior
As I am using batch_size 32, I do expect my output to be a sequence of dicts of length 32. However, it only returns the first element each and every time.