Open Delaunay opened 6 days ago
* no training rate retrieved
* Error codes = 1, 1
* 1 exceptions found
* 1 x RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::268435456 (256)MB
| Traceback (most recent call last):
| File "/homes/delaunap/hpu/results/venv/torch/bin/voir", line 8, in <module>
| sys.exit(main())
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/voir/cli.py", line 128, in main
| ov(sys.argv[1:] if argv is None else argv)
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/voir/phase.py", line 331, in __call__
| self._run(*args, **kwargs)
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/voir/overseer.py", line 242, in _run
| set_value(func())
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/voir/scriptutils.py", line 37, in <lambda>
| return lambda: exec(mainsection, glb, glb)
| File "/homes/delaunap/milabench/benchmarks/huggingface/bench/__main__.py", line 208, in <module>
| main()
| File "/homes/delaunap/milabench/benchmarks/huggingface/bench/__main__.py", line 204, in main
| runner.train()
| File "/homes/delaunap/milabench/benchmarks/huggingface/bench/__main__.py", line 120, in train
| loss = self.step(data)
| File "/homes/delaunap/milabench/benchmarks/huggingface/bench/__main__.py", line 96, in step
| accelerator.mark_step()
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/utils/internal.py", line 27, in wrapper
| func(*args, **kwargs)
| File "/homes/delaunap/hpu/results/venv/torch/lib/python3.10/site-packages/habana_frameworks/torch/core/step_closure.py", line 66, in mark_step
| htcore._mark_step(device_str, sync)
| RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::268435456 (256)MB
Eager Mode