Closed darwinharianto closed 2 years ago
Did you fix this?
I just count count the token and append [PAD] if it is odd only for xlnet. I used the above tokenize_function
def tokenize_function(examples):
token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
for i, item in enumerate(token_res["input_ids"]):
if len(item) % 2 != 0:
token_res["input_ids"][i].insert(-2,5)
token_res["attention_mask"][i].insert(-2,0)
token_res["token_type_ids"][i].insert(-2,1)
return token_res
raw_datasets = load_dataset('text', data_files={
'train': [str(x) for x in Path("dataset/").glob("**/*_train.txt")],
'test': [str(x) for x in dataset/").glob("**/*_test.txt")]
})
# process dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
can you please share the code you ran and output with the error?
@darwinharianto Going to close this since I never god a reply back. Feel free to open it if you have more input on my previous question.
Hey, thanks for sharing a comprehensive pre training on huggingface.
I have a few questions on training procedure inside your notebook https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb
When I tried to train on XLNet, it throws an error.
ValueError: This collator requires that sequence lengths be even to create a leakage-free perm_mask. Please see relevant comments in source code for details.
What do you do to prevent this?
I tried to it on tokenizer, but I don't know if it is ok.
Something like this