gmihaila / ml_things

This is where I put things I find useful that speed up my work with Machine Learning. Ever looked in your old projects to reuse those cool functions you created before? Well, this repo is designed to be a Python Library of functions I created in my previous project that can be reused. I also share some Notebooks Tutorials and Python Code Snippets.
https://gmihaila.github.io
Apache License 2.0
245 stars 61 forks source link

Question on XLNet training #13

Closed darwinharianto closed 2 years ago

darwinharianto commented 3 years ago

Hey, thanks for sharing a comprehensive pre training on huggingface.

I have a few questions on training procedure inside your notebook https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb

When I tried to train on XLNet, it throws an error.

ValueError: This collator requires that sequence lengths be even to create a leakage-free perm_mask. Please see relevant comments in source code for details.

What do you do to prevent this?

I tried to it on tokenizer, but I don't know if it is ok. Something like this

# this input_ids 5 is <pad token>, since xlnet tokenizer ends with <sep> and <cls>, I put that before <sep> (-2)
# attention_mask and token_type_ids just preserve the previous state
def tokenize_function(examples):
    token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    for i, item in enumerate(token_res["input_ids"]):
        if len(item) % 2 != 0:
            token_res["input_ids"][i].insert(-2,5)
            token_res["attention_mask"][i].insert(-2,0)
            token_res["token_type_ids"][i].insert(-2,1)

    return token_res
gmihaila commented 2 years ago

Did you fix this?

darwinharianto commented 2 years ago

I just count count the token and append [PAD] if it is odd only for xlnet. I used the above tokenize_function


def tokenize_function(examples):
    token_res = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    for i, item in enumerate(token_res["input_ids"]):
        if len(item) % 2 != 0:
            token_res["input_ids"][i].insert(-2,5)
            token_res["attention_mask"][i].insert(-2,0)
            token_res["token_type_ids"][i].insert(-2,1)

    return token_res

raw_datasets = load_dataset('text', data_files={
    'train': [str(x) for x in Path("dataset/").glob("**/*_train.txt")], 
    'test': [str(x) for x in dataset/").glob("**/*_test.txt")]
    })

# process dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
gmihaila commented 2 years ago

can you please share the code you ran and output with the error?

gmihaila commented 2 years ago

@darwinharianto Going to close this since I never god a reply back. Feel free to open it if you have more input on my previous question.