Using distributed or parallel set-up in script?: Yes part of the model is wrapped in torch dataparellel
Using GPU in script?: Yes
GPU type: Quadro RTX 6000
Who can help?
@muellerzr @SunMarc
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)
Reproduction
I'm trying to log the evaluation metrics of my model to tensorboard so that I can monitor training.
My compute metrics looks like this:
# Metrics for the output are only loaded once
acc = evaluate.load("accuracy")
metrics = evaluate.combine(['precision', 'recall', 'f1'])
# Function to calculate metrics for evaluation
def compute_metrics(eval_pred):
# Convert logits to predictions
predictions = argmax(eval_pred.predictions, axis=-1)
results = metrics.compute(predictions=predictions, references=eval_pred.label_ids, average='micro')
results['accuracy'] = acc.compute(predictions=predictions, references=eval_pred.label_ids)['accuracy']
return results
These are my training arguments:
training_args = TrainingArguments(
torch_compile=True,
torch_compile_mode="default",
fp16=True,
output_dir=os.path.abspath('./checkpoints'), # Output directory for checkpoints
num_train_epochs=EPOCHS, # Total number of training epochs
per_device_train_batch_size=BATCH_SIZE, # Batch size per device during training
per_device_eval_batch_size=BATCH_SIZE, # Batch size for evaluation
logging_dir='./logs',
report_to='tensorboard',
logging_strategy='steps',
log_level='debug',
logging_steps=100,
gradient_accumulation_steps=1,
do_eval=True, # Force evaluation, otherwise it might not work
eval_strategy='steps', # Evaluate at regular step intervals
eval_steps=500, # Evaluate every 500 steps
save_strategy='steps',
save_steps=500,
save_total_limit=10
)
In the tensorboard logs I cannot find anything related to eval metric cards if I pass "max-autotune" to compile mode. With "reduce-overhead" and no compilation the eval cards corresponding to speed, and number of eval samples... are there but the metrics themselves are always missing.
A few things to note:
The trainer does log training metrics such as logs correctly so it can see the tensorboard instance.
The metrics do get calculated but then are discarded.
System Info
Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm trying to log the evaluation metrics of my model to tensorboard so that I can monitor training. My compute metrics looks like this:
These are my training arguments:
In the tensorboard logs I cannot find anything related to eval metric cards if I pass "max-autotune" to compile mode. With "reduce-overhead" and no compilation the eval cards corresponding to speed, and number of eval samples... are there but the metrics themselves are always missing.
A few things to note: The trainer does log training metrics such as logs correctly so it can see the tensorboard instance. The metrics do get calculated but then are discarded.
Expected behavior
I would like to get the logs in tensorboard.