huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
211 stars 63 forks source link

Error when defining a custom_metric in the Trainer #132

Closed samir-souza closed 20 hours ago

samir-souza commented 1 year ago

Trainer fails when you define a custom_metric, even for small models and small batch sizes. I tried different metrics as well, but it always fails. If you remove the custom_metric everything works well. My suspicion is that it uses GRPC under the hood and the message buffer is limited to 4194304 bytes. If that is the case, maybe you can provide a custom way to redefine this value: i.e: Via env vars?

 File "/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 1224, in mesh_reduce
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:377 : **Failed to meet rendezvous 'nested_gather': Received message larger than max (250042240 vs. 4194304) (8)**
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:377 : Failed to meet rendezvous 'nested_gather': Received message larger than max (250042240 vs. 4194304) (8)
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:377 : Failed to meet rendezvous 'nested_gather': Received message larger than max (250042240 vs. 4194304) (8)
    xdata = rendezvous(tag, bio.getvalue())
  File "/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/torch_xla/core/xla_model.py", line 1176, in rendezvous

Code to replicate the problem.

import os
import math
import torch
from evaluate import load
from datasets import load_from_disk

import transformers
from transformers import AutoTokenizer
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForCausalLM, TrainingArguments
from optimum.neuron import TrainiumTrainer as Trainer

bf16=True
num_epochs=1
block_size = 128
output_dir='output'
learning_rate=5e-5
per_device_eval_batch_size=2
per_device_train_batch_size=2
model_id='bert-base-uncased'
dataset_path='datasets/eli5'

perplexity = load("perplexity", module_type="metric")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return perplexity.compute(predictions=predictions, references=labels)

if __name__=='__main__':
    os.environ['TOKENIZERS_PARALLELISM'] = 'false'

    train_dataset=load_from_disk(os.path.join(dataset_path, 'train'))
    eval_dataset=load_from_disk(os.path.join(dataset_path, 'eval'))

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token_id = 0

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    model = AutoModelForCausalLM.from_pretrained(model_id,trust_remote_code=True)
    training_args = TrainingArguments(
        evaluation_strategy="epoch",
        learning_rate=learning_rate,
        weight_decay=0.01,
        bf16=bf16,
        num_train_epochs=num_epochs,
        output_dir=output_dir,
        overwrite_output_dir=True,

        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=500,
        save_strategy="epoch",
        save_total_limit=2,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    trainer.train()

    eval_results = trainer.evaluate()
    print(f"Loss: {eval_results['eval_loss']}")
HuggingFaceDocBuilderDev commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

github-actions[bot] commented 6 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 20 hours ago

This issue was closed because it has been stalled for 5 days with no activity.