huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

129.76k stars 25.78k forks source link

Extractive summarization pipeline #12460

Open Lukecn1 opened 3 years ago

Lukecn1 commented 3 years ago

🚀 Feature request

An extractive summarization pipeline similar to the one for abstractive summarization.

A central place for researchers to upload new models for others to use, without having to run the code from various git repo's.

Currently, extractive summarization is the only safe choice for producing textual summaries in practices. Therefore, it seems relevant for Huggingface to include a pipeline for this task.

This has previously been brought up here: https://github.com/huggingface/transformers/issues/4332, but the issue remains closed which is unfortunate, as I think it would be a great feature.

Motivation

The current abstractive summarization pipeline is certainly very useful, and a great feature for all working on NLG tasks.

However, given the significant problems with factual consistency in asbtractive summaries (see in example: https://arxiv.org/abs/2104.13346, https://arxiv.org/abs/2104.14839), abstractive summaries are still very risky to use in practice, as even the state of the art models are riddled with factual errors.

Any thoughts on this? :)

sijunhe commented 2 years ago

I'd be down to work on this!

LysandreJik commented 2 years ago

If we have models on the Hub that are trained to perform this, then it would be fun to have support for it.

WDYT @Narsil @patil-suraj ?

Narsil commented 1 year ago

Seems like a good idea to me.

Since it's performing the same task from a user's perspective and models can only do 1 type of summary, I think we should aim to keep a single pipeline + task for this and decide which one to use based on AutoModelForXXX class.

In then end user's then don't need to get the difference between the two, only the model's performance will be the judge and they don't have to understand the lower level difference. Also they can switch from one to the other with extremely low effort,

We already have an example for doing this in the AutomaticSpeechRecognitionPipeline.

sijunhe commented 1 year ago

@LysandreJik I looked at the hub and also the existing model definitions and I couldn't find much related to extractive summarization. Before we have this pipeline, don't we need things like AutoModelForExtractiveSummarization?

@Narsil I understand that we can rely on AutoModelForXXX for selecting the right model type for a given pretrained model. But can we have different preprocessor and post processor in the pipeline based on some sort of identifier in the pretrained model? Extractive and abstractive summarization has completely different preprocessor and post processor.

Narsil commented 1 year ago

It's the same for ASRPipeline. the if can be located both in preprocessing and postprocessing and in forward without any issues.

There are actually ways we could imagine splitting things into subclasses, but that makes hacking slightly harder (when user send their own custom model) because that's a new layer they have to figure out. (which is the subclass for me and how does it work ?)

That's why it's not done atm. But if the code becomes a very silly list of if in every method, then we could definitely revisit that choice.

The best way forward imo is to have a starting implementation with actual code so we can discuss further (maybe some preprocessing can actually be shared or at least arguments named aligned and so on)

utility-aagrawal commented 10 months ago

I wish this feature was already available!

Narsil commented 10 months ago

Do you have models in transformers already performing the task ?

Usually wrapping them up in the pipeline is relatively low effort (compared to creating a new pipeline)