[FEA] Support packing to a max input sequence length with cudf-subword tokenizer

Is your feature request related to a problem? Please describe.

Currently, the tokenized string is shorter than max_length, output is be padded with 0s. So if max( tokenized string lengths) < max_length, it leads to performance penalties as the compute time for Transformer models is often proportional to the sequence length of the input .

HuggingFace's tokenizer defaults to padding to max input sequence length if max_length and pad_to_max_length are not provided . We should try to follow that, this is especially beneficial for streaming cases that feature https://github.com/rapidsai/cudf/issues/5868 will help.

See below example:

Padding to max sequence length.(Proposed Default Behaviour)

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)
data = ['a', 'a b', 'a b c d']
output = tokenizer.batch_encode_plus(data,padding=True,add_special_tokens=False, return_tensors = 'pt')
output['input_ids']

tensor([[1037,    0,    0,    0],
        [1037, 1038,    0,    0],
        [1037, 1038, 1039, 1040]])

Padding to max_length (Current Default Behavior)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)
output = tokenizer.batch_encode_plus(
        data, truncation=True, max_length=64, pad_to_max_length=True,
        add_special_tokens=False, return_tensors = 'pt'
    )
output['input_ids']

tensor([[1037,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [1037, 1038,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [1037, 1038, 1039, 1040,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]])

Related Implications:

a. We might have to switch from returning one-dimensional cupy arrays to 2-dimensional arrays for token-ids and attention masks which we allready do for most workflow cases so should not have performance penalties.

Describe alternatives you've considered

Currently, a user can do the tokenization twice.

First time to get maximum sequence length, do this without a to_dlpack call.
Inputting that sequence length to tokenizer again and then convert to tensors using dlpack

I do above for gpu-bdb q27 HF. ), As most of the time is spent doing to_dlpack so this workaround should not have big performance implications.

CC: @raykallen , @randerzander , @davidwendt

rapidsai / cudf