TF Dataset Pipeline throws `RuntimeError: Already borrowed` when tokenizing

DarshanDeshpande commented 3 years ago

Environment info

transformers version: master (4.4.0dev0)
Platform: Google colab
Python version: 3.7
PyTorch version (GPU?): None
Tensorflow version (GPU?): 2.4
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@jplu

Information

Model I am using (Bert, XLNet ...): None

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

This might be somewhat of a duplicate of #9629 but in a different use case

dataset = tf.data.TextLineDataset("/content/train.txt")
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained("/content/Tokenizer", do_lower_case=False)

def tokenize(sentence):
  sentence = sentence.numpy().decode('utf-8')
  a = tokenizer.encode_plus(sentence, padding="max_length", max_length=256, truncation=True, return_tensors="tf")
  return tf.constant(a.input_ids), tf.constant(a.attention_mask), tf.constant(a.input_ids)

def get_tokenized(sentence):
  a = tf.py_function(tokenize, inp=[sentence], Tout=[tf.int32, tf.int32, tf.int32])
  return {"input_ids": a[0], "attention_mask": a[1]}, a[2]

dataset = dataset.map(get_tokenized, num_parallel_calls=tf.data.AUTOTUNE)
# dataset = dataset.apply(tf.data.experimental.assert_cardinality(8000))
print(next(iter(dataset)))

Error

UnknownError: RuntimeError: Already borrowed
Traceback (most recent call last):

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/script_ops.py", line 247, in __call__
    return func(device, token, args)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/script_ops.py", line 135, in __call__
    ret = self._func(*args)

  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 620, in wrapper
    return func(*args, **kwargs)

  File "<ipython-input-34-2e27f300f71b>", line 9, in tokenize
    a = tokenizer.encode_plus(sentence, padding="max_length", max_length=256, truncation=True, return_tensors="tf")

  File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,

  File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,

  File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py", line 379, in _batch_encode_plus
    pad_to_multiple_of=pad_to_multiple_of,

  File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py", line 330, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)

RuntimeError: Already borrowed

     [[{{node EagerPyFunc}}]]

The important thing that I should probably mention here is that if I change my code to load the same using the tokenizers library, the code executes without any issues. I have also tried using the slow implementation and the error still persists. Any help regarding this would be great!

Expected behavior

Tokenization should happen on the fly without errors as it does with the Tokenizer from the tokenizers library.

jplu commented 3 years ago

Hello!

I do not suggest to convert your sentence on the fly but you should do it beforehand. Here the issue you get is because of sentence = sentence.numpy().decode('utf-8'), your sentences should not be loaded in a tf datasets before to be processed.

I recommend you to read your file normally, convert your examples with the tokenizer, and then create a tf.Dataset with the output of the tokenizer. The best solution would be to create a TFRecord file and then stream this file into your pipeline.

DarshanDeshpande commented 3 years ago

Hey @jplu I understand this completely. Infact I did end up creating TFRecords for better training speed but I created this issue just to ask if something was wrong with the tokenizer in the transformers library. As I said before, if I use the tokenizer from the tokenizers library, it works perfectly fine and I can load the data on-the-fly.

Also, as a side question, does TF masked language model require some custom script to mask tokens randomly as is done by DataCollatorForLanguageModelling for torch?

jplu commented 3 years ago

You have to create your own function to randomly mask the tokens. There is no such function implemented in TF side for now.

DarshanDeshpande commented 3 years ago

@jplu Okay thanks. Will you be accepting PRs which implement these functions? Or is someone already working on this?

jplu commented 3 years ago

Here a function I'm using for doing this, you can adapt it for your needs:

def encode(examples, block_size=512):
    # `examples` is a list of textual content, the output of a dataset from the datasets lib
    # `block_size` represents the max position size of a model.
    input_ids = []
    texts = []
    labels = []
    for example in examples["text"]:
        tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(example))

        for i in range(0, len(tokenized_text), block_size - 2):
            tmp_ids = np.asarray(tokenizer.prepare_for_model(tokenized_text[i : i + block_size - 2], padding="max_length", return_attention_mask=False, return_token_type_ids=False)["input_ids"])
            text = " ".join(tokenizer.convert_ids_to_tokens(tmp_ids, skip_special_tokens=True))
            tmp_labels = np.copy(tmp_ids)
            probability_matrix = np.full(tmp_labels.shape, 0.15)
            special_tokens_mask = tokenizer.get_special_tokens_mask(tmp_labels, already_has_special_tokens=True)
            probability_matrix = np.ma.array(probability_matrix, mask=special_tokens_mask, fill_value=0.0).filled()

            if tokenizer._pad_token is not None:
                padding_mask = np.equal(tmp_labels, tokenizer.pad_token_id)
                probability_matrix = np.ma.array(probability_matrix, mask=padding_mask, fill_value=0.0).filled()

            masked_indices = np.random.default_rng().binomial(1, probability_matrix) != 0
            tmp_labels[~masked_indices] = -100
            indices_replaced = (np.random.default_rng().binomial(1, np.full(tmp_labels.shape, 0.8)) != 0) & masked_indices
            tmp_ids[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
            indices_random = (np.random.default_rng().binomial(1, np.full(tmp_labels.shape, 0.5)) != 0) & masked_indices & ~indices_replaced
            random_words = np.random.randint(len(tokenizer), size=tmp_labels.shape)
            tmp_ids[indices_random] = random_words[indices_random]

            assert tmp_ids.size == tmp_labels.size == 512, 'size input_ids: %r -- size labels: %r' % (tmp_ids.size, tmp_labels.size)

            input_ids.append(tmp_ids.tolist())
            labels.append(tmp_labels.tolist())
            texts.append(text)

    return {"text": texts, "input_ids": input_ids, "labels": labels}

DarshanDeshpande commented 3 years ago

Thats nice! Thanks for sharing this. Closing this issue since there does exist an alternative approach to the original question

huggingface / transformers