Summarization length not controlled by max_length, min_length

xiaohy9 commented 3 years ago

I am using the pertained ctrlsum-cnndm model from transformers. I noticed that summarization text length is not exactly controlled by max_length, min_length arguments of model.generate(). Not sure why. It appears that empty spaces are included, but not sure. Please help. Thanks.

text1="The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("hyunwoongko/ctrlsum-cnndm")
model = AutoModelForSeq2SeqLM.from_pretrained("hyunwoongko/ctrlsum-cnndm")

inputs = tokenizer.encode(text1, return_tensors="pt", max_length=1024)#16
outputs = model.generate(inputs, max_length=100, min_length=50, num_beams=5, early_stopping=True)
print(tokenizer.decode(outputs[0]))

Results: max_length=100, min_length=50, actually 36 words </s> The Eiffel Tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. It is the tallest structure in Paris and the second tallest free-standing structure in France after the Millau Viaduct.</s>

max_length=200, min_length=100, actually 83 words </s> The Eiffel Tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. It was the tallest man-made structure in the world for 41 years until the Chrysler Building in New York City was finished in 1930. It is the second tallest free-standing structure in France after the Millau Viaduct, which measures 125 metres (410 ft) on each side. The tower is now taller than the Chrysler building by 5.2 metres (17 ft)</s>

NielsRogge commented 3 years ago

The max_length and min_length are in terms of tokens, not words. As some words consist of multiple tokens, this results in fewer words to be generated than you might expect.

xiaohy9 commented 3 years ago

@NielsRogge Thanks for the answer. It makes sense. But when are words consist of multiple tokens, can you give me some examples?

Also, would it be better for arguments (max_length, min_length) refer to number of words instead of tokens as to better control the outputs, which are natural language for human?

chris-aeviator commented 3 years ago

Running into a similar issue when using generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B') . I can get better control when using min_length=..,max_length=.. but I have no ultimate control when e.g. querying for Below is the code for a react app with a blue button that says 'click me'

{'generated_text': "Below is the code for a react app with a blue button that says 'click me' that is to be used by react-router. \nimport React, { Component } from 'react';\n\nimport { Link } from 'react"}]

My result is "cut off" and I would be very happy to set a desired length of resulting words.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

chris-aeviator commented 3 years ago

Stalebots are so much an anti-quality thing :-/

Zselter07 commented 3 years ago

Running into a similar issue when using generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B') . I can get better control when using min_length=..,max_length=.. but I have no ultimate control when e.g. querying for Below is the code for a react app with a blue button that says 'click me'
{'generated_text': "Below is the code for a react app with a blue button that says 'click me' that is to be used by react-router. \nimport React, { Component } from 'react';\n\nimport { Link } from 'react"}]
My result is "cut off" and I would be very happy to set a desired length of resulting words.

Same issue for me, anyone found a solution regarding this?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

chris-aeviator commented 3 years ago

Stalebots are so much an anti-quality measure and have not been fixed

LysandreJik commented 3 years ago

cc @patil-suraj @patrickvonplaten

patrickvonplaten commented 3 years ago

@chris-aeviator - do you want to have exactly max_length words? In this case you have to disable the eos_token_id => you should be able to just do model.generate(...., eos_token_id=None)

mayeulk commented 3 weeks ago

when are words consist of multiple tokens, can you give me some examples?

Unsure in English. In French, some words are grammatically contraction of several words. For instance: "La fierté du pays" = "The pride of the country" Where "du" is the contraction of "de le" (which literraly means: "of the"). So, one word ("du") for two tokens ("de le"). I guess you have a similar thing in English with: "I wanna go" = "I want to go" ("wanna" => 2 tokens)

huggingface / transformers

Summarization length not controlled by max_length, min_length #10912