models/345M/checkpoint/run1; Not a directory

zendevil commented 5 years ago

I am assuming that "token_count" means the number of space separated words.

./createspmodel.sh ../src/dataset.txt 863
Creating model from /home/psharma/gpt-2_fork/src/dataset.txt, vocabulary size is 863, sampling 172600 random lines
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
TrainerSpec {
  input: /home/psharma/gpt-2_fork/src/dataset.txt
  input_format:
  model_prefix: sp
  model_type: BPE
  vocab_size: 863
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 172600
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 16384
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  user_defined_symbols: <|n|>
  user_defined_symbols: <|endoftext|>
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: /home/psharma/gpt-2_fork/src/dataset.txt
trainer_interface.cc(315) LOG(INFO) Loaded all 3100 sentences
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <|n|>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <|endoftext|>
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=95595
trainer_interface.cc(392) LOG(INFO) Done: 99.954% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=40
trainer_interface.cc(403) LOG(INFO) Final character coverage=0.99954
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 3100 sentences.
trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 3100
trainer_interface.cc(451) LOG(INFO) Done! 522
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=1863 min_freq=1
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=686 size=20 all=749 active=708 piece=▁b
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=456 size=40 all=905 active=864 piece=lap
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=309 size=60 all=991 active=950 piece=ec
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=240 size=80 all=1065 active=1024 piece=ot
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=189 size=100 all=1148 active=1107 piece=▁blu
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=189 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=163 size=120 all=1180 active=1032 piece=▁au
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=138 size=140 all=1213 active=1065 piece=ght
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=128 size=160 all=1217 active=1069 piece=▁fit
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=115 size=180 all=1258 active=1110 piece=▁ssd
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=94 size=200 all=1278 active=1130 piece=▁windows
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=93 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=92 size=220 all=1289 active=1012 piece=▁touch
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=69 size=240 all=1303 active=1026 piece=im
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=69 size=260 all=1302 active=1025 piece=▁inter
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=54 size=280 all=1327 active=1050 piece=nce
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=48 size=300 all=1363 active=1086 piece=▁qu
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=48 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=46 size=320 all=1385 active=1021 piece=olby
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=46 size=340 all=1373 active=1009 piece=▁lightning
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=29 size=360 all=1401 active=1037 piece=ome
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=25 size=380 all=1436 active=1072 piece=di
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=24 size=400 all=1454 active=1090 piece=tro
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=24 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=420 all=1459 active=1003 piece=pu
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=440 all=1466 active=1010 piece=nty
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=460 all=1465 active=1009 piece=▁vo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=480 all=1470 active=1014 piece=trap
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=500 all=1465 active=1009 piece=ector
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=23 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=520 all=1459 active=994 piece=▁octa
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=540 all=1447 active=982 piece=▁drive
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=560 all=1434 active=969 piece=▁amoled
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=580 all=1417 active=952 piece=▁printer
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=23 size=600 all=1398 active=933 piece=▁geotagging
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=23 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=14 size=620 all=1390 active=993 piece=ite
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=9 size=640 all=1404 active=1006 piece=urity
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=7 size=660 all=1412 active=1014 piece=▁baby
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=5 size=680 all=1410 active=1012 piece=vo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=4 size=700 all=1402 active=1004 piece=ik
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=4 min_freq=0
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=4 size=720 all=1408 active=1005 piece=▁nikon
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3 size=740 all=1413 active=1010 piece=▁ama
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2 size=760 all=1411 active=1008 piece=se
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2 size=780 all=1422 active=1019 piece=mah
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2 size=800 all=1428 active=1025 piece=omen
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=2 min_freq=0
trainer_interface.cc(507) LOG(INFO) Saving model: sp.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: sp.vocab

When running train.py using the following command:

PYTHONPATH=src ./train.py --dataset models/345M/00000001.npz

Traceback (most recent call last):
  File "train.py", line 221, in <module>
    main()
  File "train.py", line 92, in main
    os.path.join(CHECKPOINT_DIR, args.run_name))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
    filename_suffix)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
    recursive_create_dir_v2(dirname)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: models/345M/checkpoint/run1; Not a directory

rkfg commented 5 years ago

The parameter --dataset you're using is incorrect. Should be --dataset models/345M/corpusname assuming you ran encode.sh your_text_file.txt 345M corpusname, all .npz files should be in models/345M/corpusname. Check if the said path models/345M/checkpoint/run1 exists and it's a directory because it might be a file or a broken symlink.

zendevil commented 5 years ago

I ran

./encode.sh ../src/dataset.txt 345M search

And now my models/345M/search contains the following files:

00000001.npz  00000003.npz  00000005.npz  00000007.npz
00000002.npz  00000004.npz  00000006.npz  00000008.npz

But I'm still getting the same error when running:

PYTHONPATH=src python3 train.py --dataset models/345M/search

rkfg commented 5 years ago

Did you check the path?

zendevil commented 5 years ago

Did you check the path?

It doesn't seem to exist.

psharma@psharma:~/gpt-2_fork/models/345M/search$ ls -lah
total 40K
drwxr-xr-x 2 psharma psharma 4.0K Jul  8 09:00 .
drwxr-xr-x 4 psharma psharma 4.0K Jul  8 09:00 ..
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000001.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000002.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000003.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000004.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000005.npz
-rw-r--r-- 1 psharma psharma 1.3K Jul  8 09:00 00000006.npz
-rw-r--r-- 1 psharma psharma 1.5K Jul  8 09:00 00000007.npz
-rw-r--r-- 1 psharma psharma 3.2K Jul  8 09:00 00000008.npz

rkfg commented 5 years ago

The path in the error message is models/345M/checkpoint/run1, not models/345M/search.

zendevil commented 5 years ago

psharma@psharma:~/gpt-2_fork/models/345M$ cd checkpoint
-bash: cd: checkpoint: Not a directory

rkfg commented 5 years ago

That's your issue. It must be a file or something. Remove or rename it and try again.

zendevil commented 5 years ago

That's your issue. It must be a file or something. Remove or rename it and try again.

Is one of your scripts supposed to create the checkpoint directory? In my case, renaming the checkpoint file and running encode.py hasn't created the checkpoint directory.

rkfg commented 5 years ago

It will be created automatically. Your problem is that it's a file already existing that can't be traversed as a directory, obviously. I guess you've copied the original 117M model that contains such file. You should not do that because you train a completely new model from scratch. All you need is sp.* files, hparams.json and your encoded .npz files in a directory.

zendevil commented 5 years ago

I guess you've copied the original 117M model that contains such file.

I deleted all contents in my models/345M directory, in which I put the following contents: hparams.json sp.model sp.vocab

where hparams.json, sp.model and sp.vocab are generated from your createspmodel.sh script.

Then I ran ./encode.sh ../src/dataset.txt 345M search

and models/345M now looks like

hparams.json  search  sp.model  sp.vocab

and there's no checkpoint directory.

rkfg commented 5 years ago

There shouldn't be just as I said, it will be created automatically. Is train.py working now?

zendevil commented 5 years ago

Thank you! When you said it would be created automatically, I initially thought you meant automatically by running encoder.sh. Yes, the training seems to have started...

zendevil commented 5 years ago

@rkfg btw, do you know how one can freeze the .ckpt files into a single .pb format file, so that the model can be loaded from that .pb file rather than .ckpt files?

rkfg commented 5 years ago

No idea, don't even know what that .pb format is.

zendevil commented 5 years ago

.pb is Google's Protobuf format. https://stackoverflow.com/questions/44516609/tensorflow-what-is-the-relationship-between-ckpt-file-and-ckpt-meta-and-ckp. For deploying these models on things like AWS Sagemaker, one must convert them to the .pb format. However, the tutorials that I followed to convert the formats didn't seem to work. If you find a way to freeze the model to .pb format and then to load it back for inference, you'll be my hero.

rkfg commented 5 years ago

Thanks for info. I don't use cloud services for my deep learning hobby projects, they're usually too expensive compared to just using my own hardware + electricity bill. But good luck with your experiments!

zendevil commented 5 years ago

Anyway, after training the model, I moved the models directory into src, and ran

python3 interactive_conditional_samples.py

But I get the following error:

2019-07-08 10:13:55.318670: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-08 10:13:55.324985: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-07-08 10:13:55.326088: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55d5fcdee0a0 executing computations on platform Host. Devices:
2019-07-08 10:13:55.326139: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Traceback (most recent call last):
  File "interactive_conditional_samples.py", line 102, in <module>
    fire.Fire(interact_model)
  File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/psharma/.local/lib/python3.5/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "interactive_conditional_samples.py", line 70, in interact_model
    saver.restore(sess, ckpt)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1264, in restore
    raise ValueError("Can't load save_path when it is None.")
ValueError: Can't load save_path when it is None.

rkfg commented 5 years ago

Why would you move the directory? Everything is already where it should be. Just run the script using python3 src/interactive_conditional_samples.py

zendevil commented 5 years ago

I reverted the models folder back to the original location and ran python3 src/interactive_conditional_samples.py I still get the same error. ---edit----

345M/checkpoint/run1 directory

events.out.tfevents.1562578405.psharma  model-57.data-00000-of-00001  model-57.index

--- edit --- The error is gone now. I think while quitting the model, I had pressed Ctrl+C too many times and so the .meta file wasn't created.

rkfg commented 5 years ago

Update, this script exports a TF checkpoint to .pb, I think you can adapt it to your needs for GPT-2.

rkfg / gpt-2

models/345M/checkpoint/run1; Not a directory #8