ml5js / ml5-data-and-models

Data sets and pre-trained models for ml5.js
https://ml5js.org/docs/data-overview
MIT License
125 stars 98 forks source link

trailing slash with --data_dir option causes confusing error #26

Closed carlcorder closed 6 years ago

carlcorder commented 6 years ago

Hello,

I was attempting to train my own model on a small corpus of text and just after the final epoch I received the following error:

...
4999/5000 (epoch 999), train_loss = 1.355, time/batch = 0.419
model saved to checkpoints/
Getting the model's vocabulary
Traceback (most recent call last):
  File "train.py", line 171, in <module>
    main()
  File "train.py", line 69, in main
    train(args)
  File "train.py", line 166, in train
    model_vocab = getModelVocab(model_name)
  File "train.py", line 73, in getModelVocab
    with open(os.path.join('checkpoints', model_name, 'chars_vocab.pkl'), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/chars_vocab.pkl'

My checkpoints/ folder contains:

-4999.data-00000-of-00001  -4999.index  -4999.meta  checkpoint  input.txt

I read through the documentation but didn't see a reference to chars_vocab.pkl, so I'm not sure if this is a file I need to generate prior to training?

Thanks for your help!

sebastiankienzl commented 6 years ago

Hey! I had the same problem when following the instructions on the docs and for me it seems to have been caused by the trailing slash in the --data_dir=./data/my_own_data/-parameter.

The train.py-script determines the name of the subfolder to create in checkpoints\ like so:

model_name = args.data_dir.split("/")[-1]

This returns an empty string when you have the trailing slash and no subfolder is created in checkpoints\. So maybe running the command with --data_dir=./data/my_own_data without the slash might help you too.

noemiino commented 6 years ago

Thank you @sebastiankienzl. I had the same issue as @carlcorder and it solved the error.

carlcorder commented 6 years ago

@sebastiankienzl That was exactly my issue. Thanks!

shiffman commented 6 years ago

I'm re-opening this issue as I think I ran into this while training some models too! While I'm glad to know the fix I'm wondering if (a) we should include a note in the documentation regarding this error? or (b) revise the python code to allow for a trailing slash as well? @cvalenzuela what do you think?

cvalenzuela commented 6 years ago

I'm in the process of writing a new tutorial and fixing this issues. I've created a separate repo since I thinks it makes more sense to have individual repositories for each model training.

https://github.com/ml5js/training-lstm

This issue should be solved in that repo

shiffman commented 6 years ago

Oh I like this idea! Perhaps then we should consider changing the ml5-data-and-training to just ml5-datasets? Or would it stay and data-and-training and link out to other repos?

cvalenzuela commented 6 years ago

yep, I think ml5-datasets makes more sense. And the individual repository for each training tutorial

cvalenzuela commented 6 years ago

A small update on this. Here's a post on how to train a model with GPU support and port the model to ml5: https://blog.paperspace.com/training-an-lstm-and-using-the-model-in-ml5-js/ It is based on https://github.com/ml5js/training-lstm