Open shangw-nvidia opened 3 years ago
Hi, thanks for opening an issue! We have the padding_side
tokenizer attribute, but it doesn't work for truncation unfortunately.
@n1t0, what do you think?
@LysandreJik Thanks a lot for your response! @n1t0 I'm wondering what your thoughts are on this feature?
🚀 Feature request
For
BertTokenizerFast
(inherited fromPreTrainedTokenizerFast
), it seems like__call__
only supports truncating from the end of the sequences if we settruncation
to belongest_first
,only_first
oronly_second
. For example, assumingmax_length
is 6 andtruncation
islongest_first
:(
I have a pen
,I have an apple
) --> truncation --> (I have a
,I have an
)However, if we take a closer look at Google's original data-preprocessing script for BERT, truncation can happen at both ends of the sequence randomly:
(
I have a pen
,I have an apple
) --> truncation --> (I have a
,have an apple
) or (have a pen
,I have an
) or (I have a
,I have an
) or (have a pen
,have an apple
)For
BertTokenizer
, perhaps I could reassigned itstruncate_sequences
member function (https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py#L2887) with a new function that implements Google's truncation scheme; however, forBertTokenizerFast
, truncation is handled completely in Rust, about which I can't do anything.An alternative is to call
tokenize
first, then truncate the sequence using Google's scheme,then call. However, this approach has significant performance impact comparing to calling__call__
and passingis_split_into_words
asTrue
__call__
on a batch of sequences directly (the average total tokenization latency doubled in our experiments).I'm wondering if's possible to add official support for random truncation from both ends of the sequence?
Motivation
To match Google's truncation scheme exactly and minimizing artificial impacts on pretraining convergence.
Your contribution
Unfortunately I'm not very familiar with Rust (I can read it, but I neve learned/wrote Rust before), thus I can't help much.