salesforce / ctrl-sum

Resources for the "CTRLsum: Towards Generic Controllable Text Summarization" paper
https://arxiv.org/abs/2012.04281
BSD 3-Clause "New" or "Revised" License
146 stars 24 forks source link

Summary length truncated even set min_length explicitly #6

Closed jiacheng-xu closed 3 years ago

jiacheng-xu commented 3 years ago

❓ Questions and Help

What is your question?

I tried to use the CTRLSum model to generate summaries. However, the generated summaries are always truncated, even I explicitly set the min_length in the generate function .

Code

from transformers import AutoModelForSeq2SeqLM, PreTrainedTokenizerFast model = AutoModelForSeq2SeqLM.from_pretrained("hyunwoongko/ctrlsum-cnndm") tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/ctrlsum-cnndm") inp_doc = r"Relief efforts in Nepal are intensifying after more than 2,300 people were killed in the worst earthquake there in more than 80 years. Rescue missions and aid material have started arriving in the country."

data = tokenizer(inp_doc, return_tensors="pt") input_ids, attention_mask = data["input_ids"], data["attention_mask"] decoded = model.generate(input_ids, attention_mask=attention_mask, num_beams=5,min_length=100) print(decoded)

output: tensor([[ 2, 901, 87, 132, 6, 2965, 82, 58, 848, 11, 5, 2373, 8969, 11, 55, 87, 1812, 107, 4, 5]])

#### What have you tried? I took a look at the config.json of the [model card ](https://huggingface.co/hyunwoongko/ctrlsum-cnndm/tree/main) There was no hard-coded limit as far as I can tell. I also looked at [generate function ](https://huggingface.co/transformers/_modules/transformers/generation_utils.html#GenerationMixin.generate). early_stop is by default False. import transformers transformers.__version__ '4.3.3'
jxhe commented 3 years ago

Hi, I'll look at this issue, but as an immediate solution I would recommend using the scripts (generate_bart.py) in the repo as in README to produce summaries from a collection of keywords-guided documents. It is the original script we used in the paper, thus may be better for experiments comparison.

jiacheng-xu commented 3 years ago

Hi, I'll look at this issue, but as an immediate solution I would recommend using the scripts (generate_bart.py) in the repo as in README to produce summaries from a collection of keywords-guided documents. It is the original script we used in the paper, thus may be better for experiments comparison.

Thanks! sounds like a plan. I will give it a shot.

jiacheng-xu commented 3 years ago

I took a closer look at the generate function. You need to set both max_length and min_length. The documentation says 'max_length=None' but it actually is set to 20. There is no assertion about min_length <= max_length.