learner.save_model gives KeyError while saving tokenizer/vocab file

mohammedayub44 commented 4 years ago

I'm trying to run the multilabel classification model and while saving the model it give me an error on vocab file learner.save_model() gives below error:

Is this because I have not specified some path or because I'm not using a pretrained model path from local as in sample notebook.

My learner config is as below:

DataBunchConfig as below:

Any help appreciated. Thanks!

aaronbriel commented 4 years ago

You need to pass save_model a Path to the directory you wish to save it to.

mohammedayub44 commented 4 years ago

@aaronbriel I did try passing path to the folder and I get the same error. Is it compulsory to pass the path ? If I don't pass any path to this function (as shown in sample notebook), I see that its picking up the "OUTPUTDIR" location (maybe from the learner object) and automatically creates "model_output" folder..it writes the following 4 files - pytorch_model.bin special_tokens_map.json config.json tokenizer_config.json

But fails to create vocab.txt as mentioned in documentation. Do you see the same issue on your end ?

Thanks !

aaronbriel commented 4 years ago

I am not seeing this issue. What is the actual error?

mohammedayub44 commented 4 years ago

It just shows KeyError , which is strange. I verified my tokenizer name , model name etc. everything works fine except saving the model.

mohammedayub44 commented 4 years ago

@aaronbriel Here is the Dropbox link which contains the Python Notebook file along with data and labels folder. (training for only one epoch) If you could let me if this runs and saves model successfully on your machine. (or if any errors)

Appreciate the help!

vivekam101 commented 4 years ago

I also faced same issue trying out example @ https://github.com/kaushaltrivedi/fast-bert/blob/master/test/multi_class.ipynb. TypeError is happening while saving tokenizer.

def save_model(self, path=None):

    if not path:
        path = self.output_dir/'model_out'

    path.mkdir(exist_ok=True)

    torch.cuda.empty_cache()
    # Save a trained model
    model_to_save = self.model.module if hasattr(self.model, 'module') else self.model  # Only save the model it-self
    model_to_save.save_pretrained(path)

    # save the tokenizer
    self.data.tokenizer.save_pretrained(path)

Note: Issue is with bert, roberta model. its working fine for xlnet. Please let us know.. Thanks a ton..

arktrail commented 4 years ago

same to me, I also have this issue.

vivekam101 commented 4 years ago

Hi Aaron, Please check below trace.

aaronbriel commented 4 years ago

Sorry about the delay. I was indeed able to replicate the error with my current implementation using the latest fast-bert: File "/home/bert/.venv/lib/python3.6/site-packages/fast_bert/learner_util.py", line 128, in save_model self.data.tokenizer.save_pretrained(path) File "/home/bert/.venv/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 605, in save_pretrained vocab_files = self.save_vocabulary(save_directory) File "/home/bert/.venv/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1994, in save_vocabulary files = self._tokenizer.save(save_directory) File "/home/bert/.venv/lib/python3.6/site-packages/tokenizers/implementations/base_tokenizer.py", line 222, in save return self._tokenizer.model.save(directory, name=name) TypeError

arktrail commented 4 years ago

So can we have any way to work around of this issue? Or We could use an old version?

aaronbriel commented 4 years ago

I wasn't seeing this before the last build, so you could try with that. I was going to look into it more tomorrow.

arktrail commented 4 years ago

Cool, thank you very much!

mohammedayub44 commented 4 years ago

@aaronbriel sorry for late reply. Thanks for taking a look into it. Looking forward to your solution.

aaronbriel commented 4 years ago

FYI, fixed in https://github.com/kaushaltrivedi/fast-bert/pull/205

mohammedayub44 commented 4 years ago

Thanks @aaronbriel Will check it out. Closing this for now.

sidPN commented 4 years ago

I'm still facing this same issue. Even with the latest fix.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-80-3ed6ed15c468> in <module>
----> 1 learner.save_model()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/learner_util.py in save_model(self, path)
    126             self.model.module if hasattr(self.model, "module") else self.model
    127         )  # Only save the model it-self
--> 128         model_to_save.save_pretrained(path)
    129 
    130         # save the tokenizer

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in save_pretrained(self, save_directory)
    603                 f.write(out_str)
    604 
--> 605         vocab_files = self.save_vocabulary(save_directory)
    606 
    607         return vocab_files + (special_tokens_map_file, added_tokens_file)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in save_vocabulary(self, save_directory)
   1992     def save_vocabulary(self, save_directory):
   1993         if os.path.isdir(save_directory):
-> 1994             files = self._tokenizer.save(save_directory)
   1995         else:
   1996             folder, file = os.path.split(os.path.abspath(save_directory))

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tokenizers/implementations/base_tokenizer.py in save(self, directory, name)
    220                 The name of the tokenizer, to be used in the saved files
    221         """
--> 222         return self._tokenizer.model.save(directory, name=name)

TypeError:

aaronbriel commented 4 years ago

@kaushaltrivedi did you publish? The code is in place but fast-bert has not yet updated in pypi.

@sidPN you can use my fork until the fast-bert library is updated (git://github.com/aaronbriel/fast-bert@master#egg=fast_bert) or just download it locally.

utterworks / fast-bert

learner.save_model gives KeyError while saving tokenizer/vocab file #200