rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 883 forks source link

[FEA] Support packing to a max input sequence length with cudf-subword tokenizer #6089

Open VibhuJawa opened 4 years ago

VibhuJawa commented 4 years ago

Is your feature request related to a problem? Please describe.

Currently, the tokenized string is shorter than max_length, output is be padded with 0s. So if max( tokenized string lengths) < max_length, it leads to performance penalties as the compute time for Transformer models is often proportional to the sequence length of the input .

HuggingFace's tokenizer defaults to padding to max input sequence length if max_length and pad_to_max_length are not provided . We should try to follow that, this is especially beneficial for streaming cases that feature https://github.com/rapidsai/cudf/issues/5868 will help.

See below example:

Padding to max sequence length.(Proposed Default Behaviour)

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)
data = ['a', 'a b', 'a b c d']
output = tokenizer.batch_encode_plus(data,padding=True,add_special_tokens=False, return_tensors = 'pt')
output['input_ids']

tensor([[1037,    0,    0,    0],
        [1037, 1038,    0,    0],
        [1037, 1038, 1039, 1040]])

Padding to max_length (Current Default Behavior)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)
output = tokenizer.batch_encode_plus(
        data, truncation=True, max_length=64, pad_to_max_length=True,
        add_special_tokens=False, return_tensors = 'pt'
    )
output['input_ids']
tensor([[1037,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [1037, 1038,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [1037, 1038, 1039, 1040,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]])

Related Implications:

a. We might have to switch from returning one-dimensional cupy arrays to 2-dimensional arrays for token-ids and attention masks which we allready do for most workflow cases so should not have performance penalties.

Describe alternatives you've considered

Currently, a user can do the tokenization twice.

  1. First time to get maximum sequence length, do this without a to_dlpack call.
  2. Inputting that sequence length to tokenizer again and then convert to tensors using dlpack

I do above for gpu-bdb q27 HF. ), As most of the time is spent doing to_dlpack so this workaround should not have big performance implications.

CC: @raykallen , @randerzander , @davidwendt

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.