huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.2k stars 27.06k forks source link

Supporting truncation from both ends of the sequence in BertTokenizerFast #10082

Open shangw-nvidia opened 3 years ago

shangw-nvidia commented 3 years ago

🚀 Feature request

For BertTokenizerFast (inherited from PreTrainedTokenizerFast), it seems like __call__ only supports truncating from the end of the sequences if we set truncation to be longest_first, only_first or only_second. For example, assuming max_length is 6 and truncation is longest_first:

(I have a pen, I have an apple) --> truncation --> (I have a, I have an)

However, if we take a closer look at Google's original data-preprocessing script for BERT, truncation can happen at both ends of the sequence randomly:

(I have a pen, I have an apple) --> truncation --> (I have a, have an apple) or (have a pen, I have an) or (I have a, I have an) or (have a pen, have an apple)

For BertTokenizer, perhaps I could reassigned its truncate_sequences member function (https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py#L2887) with a new function that implements Google's truncation scheme; however, for BertTokenizerFast, truncation is handled completely in Rust, about which I can't do anything.

An alternative is to call tokenize first, then truncate the sequence using Google's scheme, then call __call__ and passing is_split_into_words as True. However, this approach has significant performance impact comparing to calling __call__ on a batch of sequences directly (the average total tokenization latency doubled in our experiments).

PS: Turned out is_split_into_words doesn't work this way (since when it sees a subword ##abc, __call__ would further tokenize it into # # abc even if is_split_into_words==True). Thus, the actual (but slow) alternative is to 1) call tokenize 2) implement the truncation scheme and making sure a subword starting with ## won't be at the boundary 3) call convert_tokens_to_string 4) call __call__. Effectively, this alternative tokenizes the same sequence twice.

I'm wondering if's possible to add official support for random truncation from both ends of the sequence?

Motivation

To match Google's truncation scheme exactly and minimizing artificial impacts on pretraining convergence.

Your contribution

Unfortunately I'm not very familiar with Rust (I can read it, but I neve learned/wrote Rust before), thus I can't help much.

LysandreJik commented 3 years ago

Hi, thanks for opening an issue! We have the padding_side tokenizer attribute, but it doesn't work for truncation unfortunately. @n1t0, what do you think?

shangw-nvidia commented 3 years ago

@LysandreJik Thanks a lot for your response! @n1t0 I'm wondering what your thoughts are on this feature?