Trainer doesn't save evaluation metrics.

System Info

Platform: Linux-4.18.0-372.9.1.el8.x86_64-x86_64-with-glibc2.28
Python version: 3.11.9
Huggingface_hub version: 0.25.1
Safetensors version: 0.4.5
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0.post301 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: Yes part of the model is wrapped in torch dataparellel
Using GPU in script?: Yes
GPU type: Quadro RTX 6000

Who can help?

@muellerzr @SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm trying to log the evaluation metrics of my model to tensorboard so that I can monitor training. My compute metrics looks like this:

# Metrics for the output are only loaded once
acc = evaluate.load("accuracy")
metrics = evaluate.combine(['precision', 'recall', 'f1'])

# Function to calculate metrics for evaluation
def compute_metrics(eval_pred):
    # Convert logits to predictions
    predictions = argmax(eval_pred.predictions, axis=-1)
    results =  metrics.compute(predictions=predictions, references=eval_pred.label_ids, average='micro')
    results['accuracy'] = acc.compute(predictions=predictions, references=eval_pred.label_ids)['accuracy']
    return results

These are my training arguments:

training_args = TrainingArguments(
    torch_compile=True,
    torch_compile_mode="default",
    fp16=True,
    output_dir=os.path.abspath('./checkpoints'),  # Output directory for checkpoints
    num_train_epochs=EPOCHS,  # Total number of training epochs
    per_device_train_batch_size=BATCH_SIZE,  # Batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,  # Batch size for evaluation
    logging_dir='./logs',
    report_to='tensorboard',
    logging_strategy='steps',
    log_level='debug',
    logging_steps=100,
    gradient_accumulation_steps=1,
    do_eval=True,  # Force evaluation, otherwise it might not work
    eval_strategy='steps',  # Evaluate at regular step intervals
    eval_steps=500,  # Evaluate every 500 steps
    save_strategy='steps',
    save_steps=500,
    save_total_limit=10
)

In the tensorboard logs I cannot find anything related to eval metric cards if I pass "max-autotune" to compile mode. With "reduce-overhead" and no compilation the eval cards corresponding to speed, and number of eval samples... are there but the metrics themselves are always missing.

A few things to note: The trainer does log training metrics such as logs correctly so it can see the tensorboard instance. The metrics do get calculated but then are discarded.

Expected behavior

I would like to get the logs in tensorboard.

huggingface / transformers