lopuhin / transformer-lm

Transformer language model (GPT-2) with sentencepiece tokenizer
164 stars 47 forks source link

Finetuning #19

Open Stamenov opened 5 years ago

Stamenov commented 5 years ago

Hi,

just wondering, since you are basing the tf train.py on nshepperd's finetuning script, I was wonder if this code also supports finetuning, or are models trained here from scratch, finetunable with nshepperd's train.py?

Best regards.

lopuhin commented 5 years ago

Hi, it's possible to resume training from a checkpoint (so it's the same functionality as fine-tuning), but it's not possible to fine-tune original gpt-2 model, because tokenizer is different.

Stamenov commented 5 years ago

I am currently looking at these models to finetune, which were trained with this repo, or at least a fork of it. So I guess simply resuming with the tf version would suffice. Thanks.

lopuhin commented 5 years ago

Oh nice, thanks for sharing the link. Then yes, fine-tuning should work.

Stamenov commented 5 years ago

Now, that I think about, I am not sure if the models are tf or torch. Is there a way to find out, given the model files: image Thanks again!

lopuhin commented 5 years ago

These are pytorch models, which is good, because TF code is not really supported, while pytorch code is better developed and supported.

Stamenov commented 5 years ago

Cool, will try on the weekend, thanks for the blazing fast responses 🥇

gooofy commented 5 years ago

hey, cool - thanks for trying out my GPT-2 models! would be happy to hear your feedback on these.

the larger GPT-2 model is still training, so if you want I can provide an updated model this week which should have slightly lower loss than the one released so far.

Stamenov commented 5 years ago

Hey @gooofy, this would be very cool, please do! Thanks.

Stamenov commented 5 years ago

Hi, it's me again. I am not sure this is the right thread to follow up, so feel free to move it/let me know. I am trying to start an adaptation from the 355M German model, but I seem to get a mismatch in the layer sizes. I guess I need the hyperparameters from the initial training: This the hyperparameter configuratoin I get in the beginning of the training:

"batch_size": 2, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": false, "n_ctx": 1024, "n_embed": 768, "n_head": 12, "n_hidden": 768, "n_layer": 12, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens

And this is a small part of the size mismatch errors:

size mismatch for blocks.11.attn.c_attn.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]). size mismatch for blocks.11.attn.c_attn.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]). size mismatch for blocks.11.attn.c_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for blocks.11.attn.c_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.g: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for ln_f.b: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

lopuhin commented 5 years ago

Right, on each invocation you'll need to set all hyperparameters, and the error is indeed due to hyperparameter mismatch. I think that correct hyperparameters should be in params.json file which comes with the model - unfortunately currently we can't load them automatically.

Stamenov commented 5 years ago

Is there a CLI for the hyperparams? I cant seem to find one.

lopuhin commented 5 years ago

Yes, it's defined implicitly via fire library, so all main arguments are settable via command-line arguments. And also params.json should contain the full argument string which can serve as an example.

Stamenov commented 5 years ago

I guess resuming is also implicit, whenever there are the *.pt files in the model directory, furthermore the params.json is being overwritten on the invocation with current ones.

lopuhin commented 5 years ago

Indeed, resuming is implicit here: https://github.com/lopuhin/transformer-lm/blob/fa3f529ff300a30cd984ea72a4e23b525b6e3f52/lm/main.py#L139-L140 and right, params.json file will be overwritten, which is not great.

Stamenov commented 5 years ago

Hey, I am running into some further problems trying to resume from the big German model, even after I set the params. I would appreciate any help. Also, again, I am running the latest version from @gooofy 's fork:

Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [00:00, 41376077.40it/s] Traceback (most recent call last): | 0/417792 [00:00<?, ?it/s] File "/home/martin/miniconda/envs/topics/bin/gpt-2", line 11, in <module> load_entry_point('lm', 'console_scripts', 'gpt-2')() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 322, in fire_main fire.Fire(only_allow_defined_args(main)) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/fire_utils.py", line 30, in _return_wrapped return function_to_decorate(*args, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 259, in main train() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 213, in train validate() File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 219, in validate valid_loss=get_valid_loss()) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/main.py", line 233, in get_valid_loss logits = model(ctx)['logits'] File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/martin/dev/gtp2/gpt-2-german/transformer-lm/lm/model.py", line 53, in forward h, present = torch.utils.checkpoint.checkpoint(block, h, past[:, i] if past is not None else None) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 128, in checkpoint return CheckpointFunction.apply(function, *args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 34, in forward check_backward_validity(args) File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in check_backward_validity if not any(inp.requires_grad for inp in inputs): File "/home/martin/miniconda/envs/topics/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 20, in <genexpr> if not any(inp.requires_grad for inp in inputs): AttributeError: 'NoneType' object has no attribute 'requires_grad' epochs: 6787it [01:16, 88.50it/s] 7%|██████▍

lopuhin commented 5 years ago

Hmm I see, this looks related to gradient checkpointing (which I didn't get a chance to try yet), I wonder if it will work if you disable it? Could be something else as well, hard to tell, sorry.

gooofy commented 5 years ago

here is the command line I am using for training this model - does this help?

gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000

params.json:

{ "argv": "/home/bofh/projects/ai/torch/bin/gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": true, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 }

Stamenov commented 5 years ago

I have now disabled the gradient checkpointing, and I get stuck at the same place, but no error this time: --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing=0 --save_every=5000", "batch_size": 3, "epochs": 10, "g_accum_gradients": 1, "hparams": { "gradient_checkpointing": 0, "n_ctx": 1024, "n_embed": 1024, "n_head": 16, "n_hidden": 1024, "n_layer": 24, "n_vocab": 50000 }, "lr": 0.00025 } Loading dataset from /bpe Train dataset has 419,935 tokens Validation dataset has 3,608 tokens Resuming from seen_tokens 2,835,581,952 epochs: 6787it [01:17, 87.53it/s] 7%|███████▍ | 27648/417792 [01:17<18:14, 356.55it/s]

gooofy commented 5 years ago

just a wild guess: maybe you're using a different torch version?

lm                      0.1.0             /home/bofh/projects/ai/torch/transformer-lm
pytorch-pretrained-bert 0.6.2
torch                   1.2.0a0+6f6a680
Stamenov commented 5 years ago

Aahh, I am indeed. pytorch-pretrained-bert 0.6.1 torch 1.0.1.post2

Stamenov commented 5 years ago

Just wondering, which CUDA version do you use? 10.0?

gooofy commented 5 years ago

yes, 10.0

gooofy commented 5 years ago

new release has finished uploading, available here: https://zamia.org/brain/

trained for 4.5 epochs on 27GB text corpus

Stamenov commented 5 years ago

Hi, for some reason, even after I installed pytorch 1.2.0, cuda 10, conda with python 3.7.4 and nvidia drivers "NVIDIA-SMI 410.104", the training just quits after 7%, with the no error message, similarly to my previous post.

attrs 19.1.0
certifi 2019.6.16 cycler 0.10.0
filelock 3.0.12
fire 0.1.3
json-lines 0.5.0
json-log-plots 0.0.1
kiwisolver 1.1.0
lm 0.1.0 /home/martin/dev/gtp2/gpt-2-german/transformer-lm matplotlib 3.0.3
numpy 1.16.2
pip 19.2.2
pyparsing 2.4.2
python-dateutil 2.8.0
sentencepiece 0.1.8
setuptools 41.0.1
six 1.12.0
torch 1.2.0
tqdm 4.31.1
wheel 0.33.4

lopuhin commented 5 years ago

@Stamenov I wonder if this could be some bug of resume code, I didn't test it that much. Does progress bar jump to 7% immediately, or it's getting there after some time? There is no error message printed, right? Can you check the exit code? Also I wonder if training from scratch will work for you (to narrow down the issue)?

gooofy commented 5 years ago

I think I resumed training for this model several times over the weeks and never noticed any issue. There is however still this so far unexplained loss spike that happened pretty early in the training, not sure if this could be related. loss_de345-root

Stamenov commented 5 years ago

@lopuhin I does take some time to get there, it also just jumps to 7% from 0, after using my GPU for some time (reported using nvidia-smi) and after briefly showing a "0/3 validation" progress bar, just below the overall progress bar.

Training from scratch works, but with the default params only. With the ones from the german model, as supplied by @gooofy, I get CUDA out of memory error. Maybe this is related?

Are there any additional logs or informations stored while the training is going on, that I could check?

EDIT: Reducing the batch size to "1" shows the same behaviour, the progress bar now shows 96% and quits.

gooofy commented 5 years ago

what gpu model are you using? my settings are aimed at 11/12GB models (1080ti / titan x)

Stamenov commented 5 years ago

I tried it with K80 and Tesla V100, same results.

Stamenov commented 5 years ago

Okay, I think I was able to debug why this code does not work for finetuning (at least my case). Basically I think this condition: https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185 does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished. In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935. How could this be fixed?

gooofy commented 5 years ago

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

hbajohr commented 4 years ago

uh, wow, nice find! congrats you got to the bottom of this :)

haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero?

Hi, this sounds great - did you implement the --finetune flag?

khalo-sa commented 4 years ago

Okay, I think I was able to debug why this code does not work for finetuning (at least my case). Basically I think this condition: https://github.com/lopuhin/transformer-lm/blob/master/lm/main.py#L185 does not account for the fact that the resuming might happen with a new training set, which can be much smaller then the original one. The condition exists immediately, as it finds that it has seen already many more tokens then the current dataset, so it must be finished. In case of the german model, it has seen already 4202621952 token, while my new finetuning dataset is only 419935. How could this be fixed?

Quite some time has passed, but I just wonder if you were able to find a solution for you? Also I'm not sure if line 185 in the code is still the same you were referring to back then?

SaschaStenger commented 4 years ago

So i've tried finetuning the german model, by just setting the seen tokens back to 0, as was suggested.

def load_model():        
        nonlocal seen_tokens
        if torch.cuda.is_available():
            state = torch.load(model_path)
        else:
             state = torch.load(model_path, map_location=torch.device('cpu'))
        if 'seen_tokens' in state:
            seen_tokens = state['seen_tokens']
        else:  # legacy format
            seen_tokens = state['step'] * step_tokens
        if finetune:
            seen_tokens = 0
        state_dict = fixed_state_dict(state['state_dict'])
        model.load_state_dict(state_dict)
        if torch.cuda.is_available():
            optimizer.load_state_dict(torch.load(optimizer_path))
        else:
            optimizer.load_state_dict(torch.load(optimizer_path, map_location=torch.device('cpu')))
        print(f'Resuming from seen_tokens {seen_tokens:,}')

But the trained model does perform much worse. And not at all, as before. I'm getting output like: der dem das den es ist der das So my question would be: is there anything else, that i have to take into account, when finetuning such a model? Or might it just be, that my finetuning dataset just isn't good? (the size of the encoded training set is around 1.5MB)

hafsahabib-educator commented 4 years ago

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

SaschaStenger commented 4 years ago

@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed.

Sorry, so far i haven't been able to. But i'm still very interested in a solution and will look into it again and post any solution that i might find. Although if anyone else has any suggestions on how to enable finetuning on this, i'd be more then happy to try them out.

hafsahabib-educator commented 4 years ago

@SaschaStenger I am trying few things. Will surely let you know if all goes well.

SaschaStenger commented 4 years ago

Thank you @hafsabukhary. I wanted to ask, if any of your approaches might have been fruitful.

hafsahabib-educator commented 4 years ago

@SaschaStenger I used the old main.py from https://github.com/gooofy/transformer-lm/tree/master/lm. updated following code in train

        prev_tokens =   0
        if finetune:
            print('fine tuning enabled')
            prev_tokens=seen_tokens
        while seen_tokens < prev_tokens+(epochs * epoch_size):

this way training continues. you have to use default parameters of German model. e.g., vocab_size,

weibelbit commented 4 years ago

hi, i have used the old main.py and tried to update the code as @hafsabukhary and @SaschaStenger proposed. but i still have the same problem as @SaschaStenger describes:

But the trained model does perform much worse. And not at all, as before. I'm getting output like: der dem das den es ist der das

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

thank you.

SaschaStenger commented 4 years ago

i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration?

has anybody found a solution for this problem? does finetuning work for you?

I'm having similar issues. I did add some more general text to my finetuning dataset, but it still takes quiet a few iterations until it produces anything intelligible. And even then it is nowhere near the original performance. Any help in this matter would be greatly appreciated.

weibelbit commented 4 years ago

Made another finetuning test around 970 epochs, now it sometimes seems to overfit, by generating sentences that are the same as in the corpus that i use, (3,1 MB .txt) on other times it just sticks random snippets together witout any sense.