huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.88k stars 26.78k forks source link

Zero-Shot Classification - Pipeline - Batch Size #19063

Closed bhacquin closed 1 year ago

bhacquin commented 2 years ago

System Info

Who can help?

@Narsil

Information

Tasks

Reproduction

class TextDataset(Dataset):

def __init__(self, list_of_text):
    self.news = list_of_text

def __len__(self):
    return len(self.news)

def __getitem__(self, idx):
    sample = {'text' : self.news[idx]}
    return sample

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0, framework='pt')

candidate_labels = ['advertisement','politics'] dataset = TextDataset(news_list)

for i in classifier(KeyDataset(dataset, 'text'),candidate_labels=candidate_labels, batch_size=32): print(i) break

Out : {'sequence': 'Love is Where It All Begins: adidas X Thebe Magugu Launch .... Herzogenaurach, Aug 15 2022 – Today, adidas launches its latest Tennis collection, created in partnership with contemporary South African...', 'labels': ['advertisement', 'global warming'], 'scores': [0.9311832189559937, 0.0002945002052001655]}

Expected behavior

As I am using batch_size 32, I do expect my output to be a sequence of dicts of length 32. However, it only returns the first element each and every time.

Narsil commented 2 years ago

Out : {'sequence': 'Love is Where It All Begins: adidas X Thebe Magugu Launch .... Herzogenaurach, Aug 15 2022 – Today, adidas launches its latest Tennis collection, created in partnership with contemporary South African...', 'labels': ['advertisement', 'global warming'], 'scores': [0.9311832189559937, 0.0002945002052001655]} Expected behavior

As I am using batch_size 32, I do expect my output to be a sequence of dicts of length 32. However, it only returns the first element each and every time.

Actually everything is working as intended. The model is indeed seeing 32 items at a time, however 32 texts in this occurence is NOT a batch of 32.

In order to work on this data, there is 1 text pair constructed for each text + candidate_labels, so in your case each text generates 2 items to be processed by the model.

This pipeline is actually quite smart, and starts by outputting in a generating fashion all the items one by one, which are automatically batched (regardless if it's the same text or not) into a batch of 32 (so here 16 texts x 2 candidate labels but it would work the same with any amount of candidate labels)

It then proceeds and run the model on this batch

Then the output is iteratively debatched, to be processed 1 texts + candidate_labels at a time, yielding the exact same output as if it wasn't batched (but it was indeed batched yielding performance speedups if used on the appropriate GPU for instance ).

Does that answer your question ?

More info there: https://huggingface.co/docs/transformers/v4.22.2/en/main_classes/pipelines#pipeline-chunk-batching https://github.com/huggingface/transformers/pull/14225

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.