Closed Leli1024 closed 2 years ago
facing same error, unable to load after finetuning. Any update ?
Ping @patrickvonplaten , but also cc @younesbelkada and @ArthurZucker .
On it 👍
@Leli1024 @omerarshad If you don't mind and have some time, maybe you can try with the latest dev build?
If you clone the repo, you can do it like pip install --upgrade -e .[dev]
.
(There are some minor fixes since then, I didn't check if they are related)
Not sure if it is related but It is possible that you have used a version of transformers before merging this PR #17225
@Leli1024 @omerarshad If you don't mind and have some time, maybe you can try with the latest dev build?
If you clone the repo, you can do it like
pip install --upgrade -e .[dev]
. (There are some minor fixes since then, I didn't check if they are related)
This totally worked thank you!!! Also not to be pedantic but I needed to remove '[dev]' from the command to run it. Just thought I should let anyone else having trouble with it know
@Leli1024 @omerarshad If you don't mind and have some time, maybe you can try with the latest dev build? If you clone the repo, you can do it like
pip install --upgrade -e .[dev]
. (There are some minor fixes since then, I didn't check if they are related)This totally worked thank you!!!
Great!
So building from source worked? or is the patch released?
So building from source worked? or is the patch released?
Building from source
I'm experiencing this issue when I try to use the Inference API to test a facebook/opt-350m
model fine tuned using transformers 4.19.3, 4.19.4, or 4.20.0, and even when I install directly from git like this:
python -m pip install git+https://github.com/huggingface/transformers
The error I'm seeing is identical to above:
Error(s) in loading state_dict for OPTForCausalLM: size mismatch for lm_head.weight: copying a param with shape torch.Size([50272, 512]) from checkpoint, the shape in current model is torch.Size([50272, 1024]).
If I download the model to my machine and run it using a pipeline, then it works - it just seems to be an issue for the Inference API.
Here are the package versions I'm using:
Hey, could you provide an example script to help us reproduce the error?
This seems to be able to reproduce it for me:
import pathlib
from datasets import DatasetDict
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
default_data_collator,
Trainer,
TrainingArguments,
)
HUGGINGFACE_API_KEY = "..."
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
training_args = TrainingArguments(
output_dir="/tmp/model",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
push_to_hub=True,
hub_strategy="end",
hub_model_id="17389",
hub_token=HUGGINGFACE_API_KEY,
)
path = pathlib.Path("/tmp/data/dataset.txt")
path.parent.mkdir(exist_ok=True)
with path.open("w") as fp:
for _ in range(10):
fp.write("Hello, world\n")
def encode(batch):
encodings = tokenizer(batch["text"], padding="max_length", truncation=True)
encodings["labels"] = encodings["input_ids"].copy()
return encodings
dataset = DatasetDict.from_text(
{"train": path.as_posix(), "validation": path.as_posix()}
).map(
encode,
remove_columns="text",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=default_data_collator,
)
trainer.train()
trainer.save_model()
Just ran this on my machine and the resulting model is here: https://huggingface.co/dhorgan/17389
Hi @ArthurZucker, have you had any luck with this? I tried running the example code above again today with v4.20.1 after #17785 was merged, but nothing seems to have changed. The new model is here, if you're interested: https://huggingface.co/dhorgan/17389-test-fix
Hey! Yeah I know where the bug is from! The inference API is not up to date with the main branch of transformers! @Narsil is the one handling that but he is in holiday! Gotta wait for a bit 😀
Hi @donaghhorgan ,
You are not including the tokenizer
in your Trainer
so it is not saved in your model: https://huggingface.co/dhorgan/17389-test-fix/tree/main
You can fix this by simply doing tokenizer.save_pretrained('....')
and uploading it or doing Trainer(tokenizer=tokenizer)
(I think, I don't use Trainer
that often personnally but I have seen that being suggested and working).
Anyhow, you can check the failure by doing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dhorgan/17389-test-fix")
It should crash (becuase no tokenizer files are there)
That's great, thanks @Narsil! It's all working for me here now.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
🐛 Bug
When the OPT-350M variant is fine-tuned via huggingface, the resulting model will give the following error when loaded
Code to load model
Training Code
Dataset module