microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
277 stars 59 forks source link

What is the proper way to write a rule.yaml for result summary? #666

Open DK-DARKmatter opened 5 days ago

DK-DARKmatter commented 5 days ago

What's the issue, what's expected?: Trying to write a result summary yaml for the resnet101 raw data generated from tutorial. But get the following warning: RuleBase: get metrics failed - model-benchmarks Here's the summary_rule.yaml:

version: v0.11
superbench:
  rules:
    resnet:
      statistics:
        - mean
        - p90
        - min
        - max
      aggregate: False
      categories: Models
      metrics:
        - model-benchmarks/pytorch-resnet101/float16_train_step_time

How to reproduce it?: sb result summary --data-file results-summary.jsonl --rule-file summary_rule.yaml --output-file-format md --output-dir ${something}

Log message or shapshot?: [rule_base.py:75][WARNING] RuleBase: get metrics failed - model-benchmarks

Additional information:

DK-DARKmatter commented 5 days ago

OK, it turns out the correct metrics should be resnet_models/pytorch-resnet101/fp16_train_step_time. Can I get a result summary for each single node and each single GPU if I have multple nodes and GPUs?