huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

Pipeline(summarization) code example and documentation needs updating #23054

Closed TomBerton closed 1 year ago

TomBerton commented 1 year ago

System Info

Using Google Colab on Mac OS Ventura 13.2.1 Chrome Version 112.0.5615.137 (Official Build) (x86_64)

Using the install command. !pip install transformers Which downloads the following:

Screenshot 2023-04-28 at 5 53 25 PM

Who can help?

@Narsil

Information

Tasks

Reproduction

In the documentation for the pipeline summarization here the example needs updating. Use the current example below:

# use bart in pytorch summarizer = pipeline("summarization") summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

Produces the following output in Google Colab. Using a pipeline without specifying a model name and revision in production is not recommended. Your max_length is set to 20, but you input_length is only 11. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=5) [{'summary_text': ' An apple a day, keeps the doctor away from your doctor away, says Dr.'}]

The documentation doesn't state what min_length= and max_length= actually do and the output doesn't tell you either.

  1. Is the max_length the maximum token length of the output or input?
  2. Based on the output from running the code, does the input length affect the output?

Running this code: # use t5 in tf summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf") summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

Produces the following output in Google Colab. . Your max_length is set to 20, but you input_length is only 13. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=6) /usr/local/lib/python3.10/dist-packages/transformers/generation/tf_utils.py:745: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( [{'summary_text': 'an apple a day, keeps the doctor away from the doctor .'}]

Expected behavior

  1. Show the expected output by using longer text as the input.
  2. Provide a clear explanation of what min_length= and max_length= actually do.
  3. Avoid warnings when running example code from documentation or specifying a stable version to use.
Narsil commented 1 year ago
  1. I beg to differ. Examples are meant to be simple to read, Having a real long form text just hinders readability imo.

  2. min_length and max_length are specified here: https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation#transformers.GenerationMixin.greedy_search.max_length

  3. @sgugger What do you think here ? I agree examples shouldn't raise warnings, however I feel odd burning the name of a specific model into this example, since users are likely to not understand where to get that model id from.

    # Fetch summarization models at https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
    summarizer = pipeline(model="philschmid/bart-large-cnn-samsum")

Something like that. That probably affects ALL examples within pipelines.

sgugger commented 1 year ago

cc @gante The warning somehow needs to be addressed so that users of the pipeline function do not see it.

gante commented 1 year ago

Hi @TomBerton 👋

The warnings you described were updated in #23128, which should make the pipeline experience more pleasant and self-documenting 🤗

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.