huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.19k stars 26.83k forks source link

run_lm_finetuning.py does not define a do_lower_case argument #177

Closed nikitakit closed 5 years ago

nikitakit commented 5 years ago

The file references args.do_lower_case, but doesn't have the corresponding parser.add_argument call.

As an aside, has anyone successfully applied LM fine-tuning for a downstream task (using this code, or maybe using the original tensorflow implementation)? I'm not even sure if the code will run in its current state. And after fixing this issue locally, I've had no luck using the output from fine-tuning: I have a model that gets state-of-the-art results when using pre-trained BERT, but after fine-tuning it performs no better than omitting BERT/pre-training entirely! I don't know whether to suspect that there are might be other bugs in the example code, or if the hyperparameters in the README are just a very poor starting point for what I'm doing.

nikitakit commented 5 years ago

On a related note: I see there is learning rate scheduling happening here, but also inside the BertAdam class. Is this not redundant and erroneous? For reference I'm not using FP16 training, which has its own separate optimizer that doesn't appear to perform redundant learning rate scheduling.

The same is true for other examples such as SQuAD (maybe it's the cause of #168, where results were reproduced only when using float16 training?)

thomwolf commented 5 years ago

Here also @tholor, maybe you have some feedback from using the fine-tuning script?

nikitakit commented 5 years ago

I figured out why I was seeing such poor results while attempting to fine-tune: the example saves model.bert instead of model to pytorch_model.bin, so the resulting file can't just be zipped up and loaded with from_pretrained.

tholor commented 5 years ago

I have just fixed the do_lower_case bug and adjusted the code for model saving to be in line with the other examples (see #182 ). I hope this solves your issue. Thanks for reporting!

As an aside, has anyone successfully applied LM fine-tuning for a downstream task (using this code, or maybe using the original tensorflow implementation)?

We are currently using a fine-tuned model for a rather technical corpus and see improvements in terms of the extracted document embeddings in contrast to the original pre-trained BERT. However, we haven't done intense testing of hyperparameters or performance comparisons with the original pre-trained model yet. This is all still work in progress on our side. If you have results that you can share in public, I would be interested to see the difference you achieve. In general, I would only expect improvements for target corpora that have a very different language style than Wiki/OpenBooks.

On a related note: I see there is learning rate scheduling happening here, but also inside the BertAdam class.

We have only trained with fp16 so far. @thomwolf have you experienced issues with LR scheduling in the other examples? Just copied the code from there.

nikitakit commented 5 years ago

Thanks for fixing these!

After addressing the save/load mismatch I'm seeing downstream performance comparable to using pre-trained BERT. I just got a big scare when the default log level wasn't high enough to notify me that weights were being randomly re-initialized instead of loaded from the file I specified. It's still too early for me to tell if there are actual benefits to fine-tuning, though.

thomwolf commented 5 years ago

All this looks fine on master now. Please open a new issue (or re-open this one) if there are other issues.

davidefiocco commented 5 years ago

I saw on https://github.com/huggingface/pytorch-pretrained-BERT/issues/126#issuecomment-451910577 that there's potentially some documentation effort underway beyond the README. Thanks a lot for this!

I wonder if there's the possibility to add more detail about how to properly prepare a custom corpus (e.g. to avoid catastrophical forgetting) finetune the models on. Asking this as my (few, so far) attempts to finetune on other corpora have been destructive for performance on GLUE tasks when compared to the original models (I just discovered this issue, maybe the things you mention above affected me too).

Kudos @thomwolf @tholor for all your work on this!