Checkpoint not generating

Gurkiratsinghk commented 3 years ago

I ran the train.py program of GPT-2 on a txt training data which has 3 stories. I used the 117M parameters model, and it runs, it trains the model, but once it stops it creates checkpoint folder inside it is run1 folder, but none of these files are generated:

checkpoint
model-xxx.data-00000-of-00001
model-xxx.index
model-xxx.meta

Use standard file APIs to check for files with this prefix. Loading dataset... 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.27it/s]

dataset has 12863 tokens

Training...

[1 | 22.35] loss=3.69 avg=3.69 [2 | 40.40] loss=3.48 avg=3.58 [3 | 72.00] loss=3.34 avg=3.50 [4 | 91.34] loss=3.45 avg=3.49 [5 | 111.14] loss=3.32 avg=3.45 [6 | 130.68] loss=3.63 avg=3.48 [7 | 146.00] loss=3.35 avg=3.46 [8 | 164.12] loss=3.33 avg=3.45 [9 | 187.81] loss=3.44 avg=3.45 [10 | 212.46] loss=3.41 avg=3.44 [11 | 238.91] loss=3.35 avg=3.43 [12 | 265.70] loss=3.07 avg=3.40 [13 | 286.85] loss=3.36 avg=3.40 [14 | 309.50] loss=3.32 avg=3.39 [15 | 327.70] loss=3.26 avg=3.38 [16 | 344.01] loss=3.22 avg=3.37 [17 | 358.19] loss=3.41 avg=3.37 [18 | 371.93] loss=2.95 avg=3.35 [19 | 386.32] loss=3.19 avg=3.34 [20 | 400.90] loss=3.51 avg=3.35 [21 | 415.34] loss=3.06 avg=3.33 [22 | 430.17] loss=3.47 avg=3.34 [23 | 444.54] loss=3.06 avg=3.33

forrtl: error (200): program aborting due to control-C event

Image PC Routine Line Source libifcoremd.dll 00007FFD7D033B58 Unknown Unknown Unknown KERNELBASE.dll 00007FFDC9D6B443 Unknown Unknown Unknown KERNEL32.DLL 00007FFDCC487034 Unknown Unknown Unknown ntdll.dll 00007FFDCC5BD241 Unknown Unknown Unknown

What should I do?

Gurkiratsinghk commented 3 years ago

I have downloaded and deleted the file 7 times

jaimu97 commented 3 years ago

Assuming you're using the train.py from nsheppard's fork, try running it with --save_every N where N is the number of steps before it auto-saves (default 1000).

For example: python train.py --dataset data.npz --save_every 10

Gurkiratsinghk commented 3 years ago

Traceback (most recent call last): File "interactive_conditional_samples.py", line 89, in fire.Fire(interact_model) File "H:\Anaconda\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 471, in _Fire target=component.name) File "H:\Anaconda\lib\site-packages\fire\core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "interactive_conditional_samples.py", line 45, in interact_model enc = encoder.get_encoder(model_name) File "U:\gpt-2\gpt-2\encoder.py", line 110, in get_encoder encoder = json.load(f) File "H:\Anaconda\lib\json__init__.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) File "H:\Anaconda\lib\json__init__.py", line 348, in loads return _default_decoder.decode(s) File "H:\Anaconda\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "H:\Anaconda\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

A new error has popped up in place of it

jaimu97 commented 3 years ago

Did you change the "data.npz" to point to where your dataset is? Or better yet, try running the same train.py command as in the original post and just add --save_every 10 to the end of that.

Gurkiratsinghk commented 3 years ago

Actually, I collected all the file in one single folder. And when I run the command which you are suggesting, it gives an error related to the JSON file. The one I have mentioned above.

JXCrazy commented 2 years ago

As you say, i want to have a question for it that checkpoints have a or some .ckpt files?

openai / gpt-2

Checkpoint not generating #285