Open Stamenov opened 5 years ago
Hi, it's possible to resume training from a checkpoint (so it's the same functionality as fine-tuning), but it's not possible to fine-tune original gpt-2 model, because tokenizer is different.
I am currently looking at these models to finetune, which were trained with this repo, or at least a fork of it. So I guess simply resuming with the tf version would suffice. Thanks.
Oh nice, thanks for sharing the link. Then yes, fine-tuning should work.
Now, that I think about, I am not sure if the models are tf or torch. Is there a way to find out, given the model files: Thanks again!
These are pytorch models, which is good, because TF code is not really supported, while pytorch code is better developed and supported.
Cool, will try on the weekend, thanks for the blazing fast responses 🥇
hey, cool - thanks for trying out my GPT-2 models! would be happy to hear your feedback on these.
the larger GPT-2 model is still training, so if you want I can provide an updated model this week which should have slightly lower loss than the one released so far.
Hey @gooofy, this would be very cool, please do! Thanks.
Hi, it's me again. I am not sure this is the right thread to follow up, so feel free to move it/let me know. I am trying to start an adaptation from the 355M German model, but I seem to get a mismatch in the layer sizes. I guess I need the hyperparameters from the initial training: This the hyperparameter configuratoin I get in the beginning of the training:
"batch_size": 2, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": false, "n_ctx": 1024, "n_embed": 768, "n_head": 12, "n_hidden": 768, "n_layer": 12, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens
And this is a small part of the size mismatch errors:
size mismatch for blocks.11.attn.c_attn.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]). size mismatch for blocks.11.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]). size mismatch for blocks.11.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for blocks.11.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.g: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.b: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
Right, on each invocation you'll need to set all hyperparameters, and the error is indeed due to hyperparameter mismatch. I think that correct hyperparameters should be in params.json
file which comes with the model - unfortunately currently we can't load them automatically.
Is there a CLI for the hyperparams? I cant seem to find one.
Yes, it's defined implicitly via fire library, so all main
arguments are settable via command-line arguments. And also params.json
should contain the full argument string which can serve as an example.
I guess resuming is also implicit, whenever there are the *.pt files in the model directory, furthermore the params.json is being overwritten on the invocation with current ones.
Indeed, resuming is implicit here: https://github.com/lopuhin/transformer-lm/blob/fa3f529ff300a30cd984ea72a4e23b525b6e3f52/lm/main.py#L139-L140 and right, params.json file will be overwritten, which is not great.
Hey, I am running into some further problems trying to resume from the big German model, even after I set the params. I would appreciate any help. Also, again, I am running the latest version from @gooofy 's fork:
Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [00:00, 41376077.40it/s] Traceback (most recent call last): | 0/417792 [00:00<?, ?it/s] File "/home/martin/miniconda/envs/topics/bin/gpt-2", line 11, in <module> load_entry_point('lm', 'console_scripts', 'gpt-2')() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 322, in fire_main fire.Fire(only_allow_defined_args(main)) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped return function_to_decorate(*args, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 259, in main train() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 213, in train validate() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 219, in validate valid_loss=get_valid_loss()) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 233, in get_valid_loss logits = model(ctx)['logits'] File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/model.py", line 53, in forward h, present = torch.utils.checkpoint.checkpoint(block, h, past[:, i] if past is not None else None) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 128, in checkpoint return CheckpointFunction.apply(function, *args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 34, in forward check_backward_validity(args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in check_backward_validity if not any(inp.requires_grad for inp in inputs): File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in <genexpr> if not any(inp.requires_grad for inp in inputs): AttributeError: 'NoneType' object has no attribute 'requires_grad' epochs: 6787it [01:16, 88.50it/s] 7%|██████▍
Hmm I see, this looks related to gradient checkpointing (which I didn't get a chance to try yet), I wonder if it will work if you disable it? Could be something else as well, hard to tell, sorry.
here is the command line I am using for training this model - does this help?
gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000
params.json:
{ "argv": "/home/bofh/projects/ai/torch/bin/gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": true, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 }
I have now disabled the gradient checkpointing, and I get stuck at the same place, but no error this time:
--n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing=0 --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": 0, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [01:17, 87.53it/s] 7%|███████▍ | 27648/417792 [01:17<18:14, 356.55it/s]
just a wild guess: maybe you're using a different torch version?
lm 0.1.0 /home/bofh/projects/ai/torch/transformer-lm
pytorch-pretrained-bert 0.6.2
torch 1.2.0a0+6f6a680
Aahh, I am indeed.
pytorch-pretrained-bert 0.6.1
torch 1.0.1.post2
Just wondering, which CUDA version do you use? 10.0?
yes, 10.0
new release has finished uploading, available here: https://zamia.org/brain/
trained for 4.5 epochs on 27GB text corpus
Hi, for some reason, even after I installed pytorch 1.2.0, cuda 10, conda with python 3.7.4 and nvidia drivers "NVIDIA-SMI 410.104", the training just quits after 7%, with the no error message, similarly to my previous post.
attrs 19.1.0
certifi 2019.6.16 cycler 0.10.0
filelock 3.0.12
fire 0.1.3
json-lines 0.5.0
json-log-plots 0.0.1
kiwisolver 1.1.0
lm 0.1.0 /home/martin/dev/gtp2/gpt-2-german/transformer-lm matplotlib 3.0.3
numpy 1.16.2
pip 19.2.2
pyparsing 2.4.2
python-dateutil 2.8.0
sentencepiece 0.1.8
setuptools 41.0.1
six 1.12.0
torch 1.2.0
tqdm 4.31.1
wheel 0.33.4
@Stamenov I wonder if this could be some bug of resume code, I didn't test it that much. Does progress bar jump to 7% immediately, or it's getting there after some time? There is no error message printed, right? Can you check the exit code? Also I wonder if training from scratch will work for you (to narrow down the issue)?
I think I resumed training for this model several times over the weeks and never noticed any issue. There is however still this so far unexplained loss spike that happened pretty early in the training, not sure if this could be related.
@lopuhin I does take some time to get there, it also just jumps to 7% from 0, after using my GPU for some time (reported using nvidia-smi) and after briefly showing a "0/3 validation" progress bar, just below the overall progress bar.
Training from scratch works, but with the default params only. With the ones from the german model, as supplied by @gooofy, I get CUDA out of memory error. Maybe this is related?
Are there any additional logs or informations stored while the training is going on, that I could check?
EDIT: Reducing the batch size to "1" shows the same behaviour, the progress bar now shows 96% and quits.
what gpu model are you using? my settings are aimed at 11/12GB models (1080ti / titan x)
I tried it with K80 and Tesla V100, same results.
Okay, I think I was able to debug why this code does not work for finetuning (at least my case).
Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished.
In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935.
How could this be fixed?
uh, wow, nice find! congrats you got to the bottom of this :)
haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?
uh, wow, nice find! congrats you got to the bottom of this :)
haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?
Hi, this sounds great - did you implement the --finetune flag?
Okay, I think I was able to debug why this code does not work for finetuning (at least my case). Basically I think this condition:
https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185
does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished. In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935. How could this be fixed?
Quite some time has passed, but I just wonder if you were able to find a solution for you? Also I'm not sure if line 185 in the code is still the same you were referring to back then?
So i've tried finetuning the german model, by just setting the seen tokens back to 0, as was suggested.
def load_model():
nonlocal seen_tokens
if torch.cuda.is_available():
state = torch.load(model_path)
else:
state = torch.load(model_path, map_location=torch.device('cpu'))
if 'seen_tokens' in state:
seen_tokens = state['seen_tokens']
else: # legacy format
seen_tokens = state['step'] * step_tokens
if finetune:
seen_tokens = 0
state_dict = fixed_state_dict(state['state_dict'])
model.load_state_dict(state_dict)
if torch.cuda.is_available():
optimizer.load_state_dict(torch.load(optimizer_path))
else:
optimizer.load_state_dict(torch.load(optimizer_path, map_location=torch.device('cpu')))
print(f'Resuming from seen_tokens {seen_tokens:,}')
But the trained model does perform much worse. And not at all, as before.
I'm getting output like:
der dem das den es ist der das
So my question would be: is there anything else, that i have to take into account, when finetuning such a model?
Or might it just be, that my finetuning dataset just isn't good? (the size of the encoded training set is around 1.5MB)
@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.
@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.
Sorry, so far i haven't been able to. But i'm still very interested in a solution and will look into it again and post any solution that i might find. Although if anyone else has any suggestions on how to enable finetuning on this, i'd be more then happy to try them out.
@SaschaStenger I am trying few things. Will surely let you know if all goes well.
Thank you @hafsabukhary. I wanted to ask, if any of your approaches might have been fruitful.
@SaschaStenger I used the old main.py from https://github.com/gooofy/transformer-lm/tree/master/lm. updated following code in train
prev_tokens = 0
if finetune:
print('fine tuning enabled')
prev_tokens=seen_tokens
while seen_tokens < prev_tokens+(epochs * epoch_size):
this way training continues. you have to use default parameters of German model. e.g., vocab_size,
hi, i have used the old main.py and tried to update the code as @hafsabukhary and @SaschaStenger proposed. but i still have the same problem as @SaschaStenger describes:
But the trained model does perform much worse. And not at all, as before. I'm getting output like:
der dem das den es ist der das
i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?
has anybody found a solution for this problem? does finetuning work for you?
thank you.
i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?
has anybody found a solution for this problem? does finetuning work for you?
I'm having similar issues. I did add some more general text to my finetuning dataset, but it still takes quiet a few iterations until it produces anything intelligible. And even then it is nowhere near the original performance. Any help in this matter would be greatly appreciated.
Made another finetuning test around 970 epochs, now it sometimes seems to overfit, by generating sentences that are the same as in the corpus that i use, (3,1 MB .txt) on other times it just sticks random snippets together witout any sense.
Hi,
just wondering, since you are basing the tf train.py on nshepperd's finetuning script, I was wonder if this code also supports finetuning, or are models trained here from scratch, finetunable with nshepperd's train.py?
Best regards.