Question on XLNet training

darwinharianto commented 3 years ago

Hey, thanks for sharing a comprehensive pre training on huggingface.

I have a few questions on training procedure inside your notebook https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb

When I tried to train on XLNet, it throws an error.

ValueError: This collator requires that sequence lengths be even to create a leakage-free perm_mask. Please see relevant comments in source code for details.

What do you do to prevent this?

I tried to it on tokenizer, but I don't know if it is ok. Something like this

# this input_ids 5 is <pad token>, since xlnet tokenizer ends with <sep> and <cls>, I put that before <sep> (-2)
# attention_mask and token_type_ids just preserve the previous state
def tokenize_function(examples):
    token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    for i, item in enumerate(token_res["input_ids"]):
        if len(item) % 2 != 0:
            token_res["input_ids"][i].insert(-2,5)
            token_res["attention_mask"][i].insert(-2,0)
            token_res["token_type_ids"][i].insert(-2,1)

    return token_res

gmihaila commented 2 years ago

Did you fix this?

darwinharianto commented 2 years ago

I just count count the token and append [PAD] if it is odd only for xlnet. I used the above tokenize_function


def tokenize_function(examples):
    token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    for i, item in enumerate(token_res["input_ids"]):
        if len(item) % 2 != 0:
            token_res["input_ids"][i].insert(-2,5)
            token_res["attention_mask"][i].insert(-2,0)
            token_res["token_type_ids"][i].insert(-2,1)

    return token_res

raw_datasets = load_dataset('text', data_files={
    'train': [str(x) for x in Path("dataset/").glob("**/*_train.txt")], 
    'test': [str(x) for x in dataset/").glob("**/*_test.txt")]
    })

# process dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

gmihaila commented 2 years ago

can you please share the code you ran and output with the error?

gmihaila commented 2 years ago

@darwinharianto Going to close this since I never god a reply back. Feel free to open it if you have more input on my previous question.

gmihaila / ml_things

Question on XLNet training #13