huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.29k stars 26.85k forks source link

Consider adding "middle" option for tokenizer truncation_side argument #17947

Closed AndreaSottana closed 2 years ago

AndreaSottana commented 2 years ago

Feature request

At the moment, thanks to this PR https://github.com/huggingface/transformers/pull/14947 the option to truncate the text from the left instead of just from the right has been added. However, for some NLP tasks like summarization of long documents, it might also be advantageous to truncate the middle part of the document instead. For example if our sequence length is 512 tokens and a document exceeds this length, we might want to keep the first 256 and the last 256 tokens of the document, and truncate everything in between. Therefore this issue is to request implementation of this option.

Motivation

The reason this feature might be helpful is is because when dealing in particular with long documents (for example for longformer summarization tasks), depending on the documents domain, the start of the document might set out relevant information, and the end of the document might contain a useful recap of the main points discussed, therefore both can be very relevant and valuable to keep, whereas the text in the middle may not be as important. Therefore adding an option truncation_side="middle", allowing retention of the first 256 and the last 256 tokens, might be very helpful for certain use cases.

Your contribution

I have limited bandwidth right now, but might consider contributing if this can be done as a quick fix and someone from HuggingFace can provide overview.

LysandreJik commented 2 years ago

WDYT @SaulLu @Narsil ?

SaulLu commented 2 years ago

Hi @AndreaSottana,

Thank you very much for sharing a feature proposal! :hugs:

I understand your use case, my feeling is that for the moment I will not push for the addition of this feature. My feeling is that at the moment it is something that can be implemented on-top of transformers and touches a problem where a user may want many different variants depending on their specific use case.

Of course, if this is a feature for which there is a lot of demand, I will gladly come back to my opinion! (so please if you are passing by feel free to share what you think :smiley:)

In terms of implementation, my opinion is that it is not a very simple addition because it will affect all tokenizers (and some are really particular like those of LayoutLM-like models) whether they are slow or fast. This also means that it would require a new feature in the rust tokenizers library.

I'm also very curious to know what you think @Narsil !

AndreaSottana commented 2 years ago

Ok that's fine, thanks a lot for getting back to me @SaulLu Let's see if there is more appetite, if not we can leave it here for now. I can always implement the truncation myself for my specific model and tokenizer, I just thought it may be a helpful feature to have, but as you said we'd need to see how much demand there is. Feel free to close the issue if appropriate

Narsil commented 2 years ago

100% agree with @SaulLu .

There might be a use case, but it doesn't seem as a blatant missing feature (and we try to focus on those). Future reader, make yourself heard so that we can revisit our opinion :)

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Quadav commented 1 year ago

It is needed indeed! :) To add the motivation to this, take a look at the article "How to Fine-Tune BERT for Text Classification?": https://arxiv.org/pdf/1905.05583.pdf They show that using head+tail achieved the best results. I think that the case when the most important content is in the beginning and\or the end is relevant to a lot of fields, including sentiment detection, hate-speech detection and more.