MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
(after switching ...adamw/jax/submission.py to ...adamw/pytorch/submission.py)
Fails at torch.compile
To reproduce
1) use a Dockerfile as below:
FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN git clone https://github.com/mlcommons/algorithmic-efficiency/ && cd algorithmic-efficiency/ && git checkout 5b4914ff18f2bb28a01c5669285b6a001ea84111
RUN cd algorithmic-efficiency/ && python3 -m pip install -e '.[jax_cpu]' && python3 -m pip install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/cu121' && python3 -m pip install -e '.[full]'
2) run the example command in the container
docker run --rm --net host --ipc host --gpus all -v /home/lucas/algorithmic-efficiency:/opt/project -it 11ad40ed5330 bash -c 'export PYTHONPATH="/opt/project/:$PYTHONPATH" ; cd /opt/project/ ; python3 submission_runner.py --framework=pytorch --workload=mnist --experiment_dir=$HOME/experiments --experiment_name=my_first_experiment --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json'
3) receive the following unmodified traceback
=============
== PyTorch ==
=============
NVIDIA Release 24.03 (build 85286408)
PyTorch Version 2.3.0a0+40ec155e58
Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.4 driver version 550.54.14 with kernel driver version 535.104.12.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
ERROR:root:Unable to import wandb.
Traceback (most recent call last):
File "/opt/project/algorithmic_efficiency/logger_utils.py", line 26, in <module>
import wandb # pylint: disable=g-import-not-at-top
ModuleNotFoundError: No module named 'wandb'
I0405 11:10:55.978553 140153789482816 logger_utils.py:76] Creating experiment directory at /root/experiments/my_first_experiment/mnist_pytorch.
I0405 11:10:56.270130 140153789482816 submission_runner.py:561] Using RNG seed 2489964499
I0405 11:10:56.270576 140153789482816 submission_runner.py:570] --- Tuning run 1/1 ---
I0405 11:10:56.270626 140153789482816 submission_runner.py:575] Creating tuning directory at /root/experiments/my_first_experiment/mnist_pytorch/trial_1.
I0405 11:10:56.270741 140153789482816 logger_utils.py:92] Saving hparams to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/hparams.json.
I0405 11:10:56.423277 140153789482816 submission_runner.py:215] Initializing dataset.
I0405 11:10:56.423400 140153789482816 submission_runner.py:226] Initializing model.
I0405 11:10:56.609794 140153789482816 submission_runner.py:264] Performing `torch.compile`.
I0405 11:10:57.714589 140153789482816 submission_runner.py:268] Initializing optimizer.
I0405 11:10:57.715128 140153789482816 submission_runner.py:275] Initializing metrics bundle.
I0405 11:10:57.715188 140153789482816 submission_runner.py:293] Initializing checkpoint and logger.
I0405 11:10:57.715620 140153789482816 submission_runner.py:313] Saving meta data to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/meta_data_0.json.
fatal: detected dubious ownership in repository at '/opt/project'
To add an exception for this directory, call:
git config --global --add safe.directory /opt/project
I0405 11:10:57.950806 140153789482816 logger_utils.py:220] Unable to record git information. Continuing without it.
I0405 11:10:58.229494 140153789482816 submission_runner.py:317] Saving flags to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/flags_0.json.
I0405 11:10:58.273115 140153789482816 submission_runner.py:327] Starting training loop.
I0405 11:10:58.482898 140153789482816 dataset_info.py:736] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1
I0405 11:10:58.719100 140153789482816 dataset_info.py:578] Load dataset info from /tmp/tmpco1rexddtfds
I0405 11:10:58.723723 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I0405 11:10:58.724064 140153789482816 dataset_builder.py:593] Generating dataset mnist (/root/data/mnist/3.0.1)
Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/data/mnist/3.0.1...
I0405 11:10:58.867788 140153789482816 dataset_builder.py:640] Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.
Dl Completed...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20.77 file/s]
I0405 11:10:59.174530 140153789482816 dataset_info.py:578] Load dataset info from /root/data/mnist/3.0.1.incompleteX9SJH5
I0405 11:10:59.176377 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name, file_format] from disk and from code do not match. Keeping the one from code.
Dataset mnist downloaded and prepared to /root/data/mnist/3.0.1. Subsequent calls will reuse this data.
I0405 11:10:59.241586 140153789482816 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /root/data/mnist/3.0.1
Traceback (most recent call last):
File "/opt/project/submission_runner.py", line 712, in <module>
app.run(main)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/opt/project/submission_runner.py", line 680, in main
score = score_submission_on_workload(
File "/opt/project/submission_runner.py", line 585, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "/opt/project/submission_runner.py", line 349, in train_once
optimizer_state, model_params, model_state = update_params(
File "/opt/project/reference_algorithms/paper_baselines/adamw/pytorch/submission.py", line 74, in update_params
logits_batch, new_model_state = workload.model_fn(params=current_model,
File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 170, in model_fn
logits_batch = model(augmented_and_preprocessed_input_batch['inputs'])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
super().run()
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX
self.call_function(fn, argsvars.items, kwargsvars.items)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/nn_module.py", line 302, in call_function
return wrap_fx_proxy(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy
return wrap_fx_proxy_cls(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1274, in wrap_fx_proxy_cls
example_value = get_fake_value(proxy.node, tx)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1376, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1337, in get_fake_value
return wrap_fake_exception(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception
return fn()
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1338, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1410, in run_node
raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1402, in run_node
return nnmodule(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 110, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
raise exception
torch._dynamo.exc.TorchRuntimeError: Failed running call_module fn(*(FakeTensor(..., device='cuda:0', size=(16, 1, 28, 28)),), **{}):
Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 43, in forward
return self.net(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1113, in __torch_dispatch__
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 448, in __call__
return self._op(*args, **kwargs or {})
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1450, in dispatch
return decomposition_table[func](*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 229, in _fn
result = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 70, in inner
r = f(*tree_map(increase_prec, args), **tree_map(increase_prec, kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 1229, in addmm
out = alpha * torch.mm(mat1, mat2)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1540, in dispatch
with in_kernel_invocation_manager(self):
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 914, in in_kernel_invocation_manager
assert meta_in_tls == prev_in_kernel, f"{meta_in_tls}, {prev_in_kernel}"
AssertionError: False, True
from user code:
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Running the example command provided in the readme here.
(after switching ...adamw/jax/submission.py to ...adamw/pytorch/submission.py) Fails at torch.compile
To reproduce 1) use a Dockerfile as below:
2) run the example command in the container
3) receive the following unmodified traceback
Using
--notorch_compile
works as expected.