Closed DarshanDeshpande closed 3 years ago
Hello!
I do not suggest to convert your sentence on the fly but you should do it beforehand. Here the issue you get is because of sentence = sentence.numpy().decode('utf-8')
, your sentences should not be loaded in a tf datasets before to be processed.
I recommend you to read your file normally, convert your examples with the tokenizer, and then create a tf.Dataset with the output of the tokenizer. The best solution would be to create a TFRecord file and then stream this file into your pipeline.
Hey @jplu I understand this completely. Infact I did end up creating TFRecords for better training speed but I created this issue just to ask if something was wrong with the tokenizer in the transformers library. As I said before, if I use the tokenizer from the tokenizers library, it works perfectly fine and I can load the data on-the-fly.
Also, as a side question, does TF masked language model require some custom script to mask tokens randomly as is done by DataCollatorForLanguageModelling for torch?
You have to create your own function to randomly mask the tokens. There is no such function implemented in TF side for now.
@jplu Okay thanks. Will you be accepting PRs which implement these functions? Or is someone already working on this?
Here a function I'm using for doing this, you can adapt it for your needs:
def encode(examples, block_size=512):
# `examples` is a list of textual content, the output of a dataset from the datasets lib
# `block_size` represents the max position size of a model.
input_ids = []
texts = []
labels = []
for example in examples["text"]:
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(example))
for i in range(0, len(tokenized_text), block_size - 2):
tmp_ids = np.asarray(tokenizer.prepare_for_model(tokenized_text[i : i + block_size - 2], padding="max_length", return_attention_mask=False, return_token_type_ids=False)["input_ids"])
text = " ".join(tokenizer.convert_ids_to_tokens(tmp_ids, skip_special_tokens=True))
tmp_labels = np.copy(tmp_ids)
probability_matrix = np.full(tmp_labels.shape, 0.15)
special_tokens_mask = tokenizer.get_special_tokens_mask(tmp_labels, already_has_special_tokens=True)
probability_matrix = np.ma.array(probability_matrix, mask=special_tokens_mask, fill_value=0.0).filled()
if tokenizer._pad_token is not None:
padding_mask = np.equal(tmp_labels, tokenizer.pad_token_id)
probability_matrix = np.ma.array(probability_matrix, mask=padding_mask, fill_value=0.0).filled()
masked_indices = np.random.default_rng().binomial(1, probability_matrix) != 0
tmp_labels[~masked_indices] = -100
indices_replaced = (np.random.default_rng().binomial(1, np.full(tmp_labels.shape, 0.8)) != 0) & masked_indices
tmp_ids[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
indices_random = (np.random.default_rng().binomial(1, np.full(tmp_labels.shape, 0.5)) != 0) & masked_indices & ~indices_replaced
random_words = np.random.randint(len(tokenizer), size=tmp_labels.shape)
tmp_ids[indices_random] = random_words[indices_random]
assert tmp_ids.size == tmp_labels.size == 512, 'size input_ids: %r -- size labels: %r' % (tmp_ids.size, tmp_labels.size)
input_ids.append(tmp_ids.tolist())
labels.append(tmp_labels.tolist())
texts.append(text)
return {"text": texts, "input_ids": input_ids, "labels": labels}
Thats nice! Thanks for sharing this. Closing this issue since there does exist an alternative approach to the original question
Environment info
transformers
version: master (4.4.0dev0)Who can help
@jplu
Information
Model I am using (Bert, XLNet ...): None
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
This might be somewhat of a duplicate of #9629 but in a different use case
Error
The important thing that I should probably mention here is that if I change my code to load the same using the tokenizers library, the code executes without any issues. I have also tried using the slow implementation and the error still persists. Any help regarding this would be great!
Expected behavior
Tokenization should happen on the fly without errors as it does with the Tokenizer from the tokenizers library.