[Kosmos-2] - Githubissues

basteran commented 9 months ago

System Info

transformers version: 4.36.2
Platform: Linux-5.15.0-84-generic-x86_64-with-glibc2.35
Python version: 3.10.0
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

I think the person in charge of Kosmos-2 is @ydshieh

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

This issue refers to another issue reported on the official Kosmos repository!

Hello everyone, thank you very much for your contribution. I appreciate the effort and consistency in uploading the code for such many models and maintaining this repository.

I saw Kosmos-2 and I quickly thought I could fine-tune it on my downstream task. But I couldn't find any example of how to do it. I see there is on the official Kosmos repository a little "guide" for Training the model, but I don't know if they're referring to the Pre-training or further fine-tuning, I'm interested in the second one.

So I tried to implement it myself using the transformers library, but I'm getting errors during the Fine-Tuning procedure.

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="auto")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="auto")

# load dummy dataset from json file
train_data = load_dataset("json", data_files=tmp_train_file_name)
val_data = load_dataset("json", data_files=tmp_val_file_name)

# process the inputs, i.e. images and texts
def kosmos2_collate_fn(examples):
    images, texts = [], []
    for example in examples:
        image = Image.open(example['image_path'])
        images.append(image)
        texts.append(example['input_text'])

    inputs = processor(text=texts, images=images, return_tensors="pt").to(model.device)
    return Dataset.from_dict(inputs)

new_train_data = kosmos2_collate_fn(train_data)
new_val_data = kosmos2_collate_fn(val_data)

training_arguments = TrainingArguments(
    remove_unused_columns=False, 
    per_device_train_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_ratio=0,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    logging_strategy="steps",
    logging_steps=1,
    optim="adamw_torch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir=OUTPUT_DIR,
    save_total_limit=1,
    load_best_model_at_end=True,
    label_names=["labels"]
)

trainer = Trainer(
    model=model,
    train_dataset=new_train_data,
    eval_dataset=new_val_data,
    args=training_arguments,
)

trainer.train()

and the resulting errors:

Generating train split: 40 examples [00:00, 8627.15 examples/s]
Generating train split: 6 examples [00:00, 2428.20 examples/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  0%|          | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/user/kosmos2/train.py", line 193, in <module>
    trainer.train()
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in compute_loss
    raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.
  0%|          | 0/10 [00:03<?, ?it/s]

I can't figure out the issue. It says that the model did not return a loss, which means it didn't compute it. It looks like the processor did not return any labels and the Trainer could not compute the loss...

Expected behavior

I would expect to train the model on my data, i.e. to compute the loss, perform gradient updates, etc.

ydshieh commented 9 months ago

Hi, see my comment

https://github.com/microsoft/unilm/issues/1429#issuecomment-1900139771

(I just saw you also opened an issue here before I replied there)

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

basteran commented 8 months ago

Hi I am still working on this issue with @ydshieh , we will update it whenever we have news!

ydshieh commented 6 months ago

Hi @basteran

Please see https://discuss.huggingface.co/t/kosmos-2-fine-tuning/75691/32?u=ydshieh

github-actions[bot] commented 5 months ago