intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

[E2E] Torchbench pyhpc_turbulent_kinetic_energy training accuracy failed #723

Open mengfei25 opened 1 month ago

mengfei25 commented 1 month ago

🐛 Describe the bug

torchbench_amp_bf16_training xpu train pyhpc_turbulent_kinetic_energy
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 302, in load_model benchmark = benchmark_cls( File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/util/model.py", line 39, in call obj = type.call(cls, *args, **kwargs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/pyhpc_turbulent_kinetic_energy/init.py", line 132, in init super().init(test=test, device=device, batch_size=batch_size, extra_args=extra_args) File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/util/model.py", line 137, in init self._determine_batch_size(batch_size) File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/util/model.py", line 262, in _determine_batch_size raise NotImplementedError( NotImplementedError: Model's DEFAULT_TRAIN_BSIZE is not implemented.

model_fail_to_load

loading model: 0it [00:00, ?it/s] loading model: 0it [00:07, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 1 month ago

This is model scripts issue, @mengfei25 please check whether A100 has such issue too.

mengfei25 commented 1 month ago

A100 has same issue