Closed zendevil closed 5 years ago
The parameter --dataset
you're using is incorrect. Should be --dataset models/345M/corpusname
assuming you ran encode.sh your_text_file.txt 345M corpusname
, all .npz
files should be in models/345M/corpusname
. Check if the said path models/345M/checkpoint/run1
exists and it's a directory because it might be a file or a broken symlink.
I ran
./encode.sh ../src/dataset.txt 345M search
And now my models/345M/search
contains the following files:
00000001.npz 00000003.npz 00000005.npz 00000007.npz
00000002.npz 00000004.npz 00000006.npz 00000008.npz
But I'm still getting the same error when running:
PYTHONPATH=src python3 train.py --dataset models/345M/search
Did you check the path?
Did you check the path?
It doesn't seem to exist.
psharma@psharma:~/gpt-2_fork/models/345M/search$ ls -lah
total 40K
drwxr-xr-x 2 psharma psharma 4.0K Jul 8 09:00 .
drwxr-xr-x 4 psharma psharma 4.0K Jul 8 09:00 ..
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000001.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000002.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000003.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000004.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000005.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul 8 09:00 00000006.npz
-rw-r--r-- 1 psharma psharma 1.5K Jul 8 09:00 00000007.npz
-rw-r--r-- 1 psharma psharma 3.2K Jul 8 09:00 00000008.npz
The path in the error message is models/345M/checkpoint/run1
, not models/345M/search
.
psharma@psharma:~/gpt-2_fork/models/345M$ cd checkpoint
-bash: cd: checkpoint: Not a directory
That's your issue. It must be a file or something. Remove or rename it and try again.
That's your issue. It must be a file or something. Remove or rename it and try again.
Is one of your scripts supposed to create the checkpoint directory? In my case, renaming the checkpoint file and running encode.py
hasn't created the checkpoint directory.
It will be created automatically. Your problem is that it's a file already existing that can't be traversed as a directory, obviously. I guess you've copied the original 117M
model that contains such file. You should not do that because you train a completely new model from scratch. All you need is sp.*
files, hparams.json
and your encoded .npz
files in a directory.
I guess you've copied the original
117M
model that contains such file.
I deleted all contents in my models/345M
directory, in which I put the following contents:
hparams.json sp.model sp.vocab
where hparams.json
, sp.model
and sp.vocab
are generated from your createspmodel.sh
script.
Then I ran ./encode.sh ../src/dataset.txt 345M search
and models/345M
now looks like
hparams.json search sp.model sp.vocab
and there's no checkpoint
directory.
There shouldn't be just as I said, it will be created automatically. Is train.py
working now?
Thank you! When you said it would be created automatically, I initially thought you meant automatically by running encoder.sh. Yes, the training seems to have started...
@rkfg btw, do you know how one can freeze the .ckpt files into a single .pb format file, so that the model can be loaded from that .pb file rather than .ckpt files?
No idea, don't even know what that .pb
format is.
.pb is Google's Protobuf format. https://stackoverflow.com/questions/44516609/tensorflow-what-is-the-relationship-between-ckpt-file-and-ckpt-meta-and-ckp. For deploying these models on things like AWS Sagemaker, one must convert them to the .pb format. However, the tutorials that I followed to convert the formats didn't seem to work. If you find a way to freeze the model to .pb format and then to load it back for inference, you'll be my hero.
Thanks for info. I don't use cloud services for my deep learning hobby projects, they're usually too expensive compared to just using my own hardware + electricity bill. But good luck with your experiments!
Anyway, after training the model, I moved the models
directory into src
, and ran
python3 interactive_conditional_samples.py
But I get the following error:
2019-07-08 10:13:55.318670: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-08 10:13:55.324985: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-07-08 10:13:55.326088: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55d5fcdee0a0 executing computations on platform Host. Devices:
2019-07-08 10:13:55.326139: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
File "interactive_conditional_samples.py", line 102, in <module>
fire.Fire(interact_model)
File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "interactive_conditional_samples.py", line 70, in interact_model
saver.restore(sess, ckpt)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1264, in restore
raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.
Why would you move the directory? Everything is already where it should be. Just run the script using python3 src/interactive_conditional_samples.py
I reverted the models
folder back to the original location and ran
python3 src/interactive_conditional_samples.py
I still get the same error.
---edit----
345M/checkpoint/run1
directory
events.out.tfevents.1562578405.psharma model-57.data-00000-of-00001 model-57.index
--- edit --- The error is gone now. I think while quitting the model, I had pressed Ctrl+C too many times and so the .meta file wasn't created.
Update, this script exports a TF checkpoint to .pb
, I think you can adapt it to your needs for GPT-2.
I am assuming that "token_count" means the number of space separated words.
When running train.py using the following command: