mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
319 stars 60 forks source link

Example fails at torch.compile #754

Open ClashLuke opened 4 months ago

ClashLuke commented 4 months ago

Running the example command provided in the readme here.

python3 submission_runner.py \
    --framework=pytorch \
    --workload=mnist \
    --experiment_dir=$HOME/experiments \
    --experiment_name=my_first_experiment \
    --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py \
    --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json

(after switching ...adamw/jax/submission.py to ...adamw/pytorch/submission.py) Fails at torch.compile


To reproduce 1) use a Dockerfile as below:

FROM nvcr.io/nvidia/pytorch:24.03-py3
RUN git clone https://github.com/mlcommons/algorithmic-efficiency/ && cd algorithmic-efficiency/ && git checkout 5b4914ff18f2bb28a01c5669285b6a001ea84111
RUN cd algorithmic-efficiency/ && python3 -m pip install -e '.[jax_cpu]' && python3 -m pip install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/cu121' && python3 -m pip install -e '.[full]'

2) run the example command in the container

docker run  --rm --net host  --ipc host --gpus all -v /home/lucas/algorithmic-efficiency:/opt/project   -it 11ad40ed5330 bash -c 'export PYTHONPATH="/opt/project/:$PYTHONPATH" ; cd /opt/project/ ; python3 submission_runner.py     --framework=pytorch     --workload=mnist     --experiment_dir=$HOME/experiments     --experiment_name=my_first_experiment     --submission_path=reference_algorithms/paper_baselines/adamw/pytorch/submission.py --tuning_search_space=reference_algorithms/paper_baselines/adamw/tuning_search_space.json'

3) receive the following unmodified traceback

=============
== PyTorch ==
=============

NVIDIA Release 24.03 (build 85286408)
PyTorch Version 2.3.0a0+40ec155e58
Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.4 driver version 550.54.14 with kernel driver version 535.104.12.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

ERROR:root:Unable to import wandb.
Traceback (most recent call last):
  File "/opt/project/algorithmic_efficiency/logger_utils.py", line 26, in <module>
    import wandb  # pylint: disable=g-import-not-at-top
ModuleNotFoundError: No module named 'wandb'
I0405 11:10:55.978553 140153789482816 logger_utils.py:76] Creating experiment directory at /root/experiments/my_first_experiment/mnist_pytorch.
I0405 11:10:56.270130 140153789482816 submission_runner.py:561] Using RNG seed 2489964499
I0405 11:10:56.270576 140153789482816 submission_runner.py:570] --- Tuning run 1/1 ---
I0405 11:10:56.270626 140153789482816 submission_runner.py:575] Creating tuning directory at /root/experiments/my_first_experiment/mnist_pytorch/trial_1.
I0405 11:10:56.270741 140153789482816 logger_utils.py:92] Saving hparams to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/hparams.json.
I0405 11:10:56.423277 140153789482816 submission_runner.py:215] Initializing dataset.
I0405 11:10:56.423400 140153789482816 submission_runner.py:226] Initializing model.
I0405 11:10:56.609794 140153789482816 submission_runner.py:264] Performing `torch.compile`.
I0405 11:10:57.714589 140153789482816 submission_runner.py:268] Initializing optimizer.
I0405 11:10:57.715128 140153789482816 submission_runner.py:275] Initializing metrics bundle.
I0405 11:10:57.715188 140153789482816 submission_runner.py:293] Initializing checkpoint and logger.
I0405 11:10:57.715620 140153789482816 submission_runner.py:313] Saving meta data to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/meta_data_0.json.
fatal: detected dubious ownership in repository at '/opt/project'
To add an exception for this directory, call:

    git config --global --add safe.directory /opt/project
I0405 11:10:57.950806 140153789482816 logger_utils.py:220] Unable to record git information. Continuing without it.
I0405 11:10:58.229494 140153789482816 submission_runner.py:317] Saving flags to /root/experiments/my_first_experiment/mnist_pytorch/trial_1/flags_0.json.
I0405 11:10:58.273115 140153789482816 submission_runner.py:327] Starting training loop.
I0405 11:10:58.482898 140153789482816 dataset_info.py:736] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1
I0405 11:10:58.719100 140153789482816 dataset_info.py:578] Load dataset info from /tmp/tmpco1rexddtfds
I0405 11:10:58.723723 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I0405 11:10:58.724064 140153789482816 dataset_builder.py:593] Generating dataset mnist (/root/data/mnist/3.0.1)
Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/data/mnist/3.0.1...
I0405 11:10:58.867788 140153789482816 dataset_builder.py:640] Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.

Dl Completed...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 20.77 file/s]
I0405 11:10:59.174530 140153789482816 dataset_info.py:578] Load dataset info from /root/data/mnist/3.0.1.incompleteX9SJH5
I0405 11:10:59.176377 140153789482816 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name, file_format] from disk and from code do not match. Keeping the one from code.
Dataset mnist downloaded and prepared to /root/data/mnist/3.0.1. Subsequent calls will reuse this data.
I0405 11:10:59.241586 140153789482816 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /root/data/mnist/3.0.1
Traceback (most recent call last):
  File "/opt/project/submission_runner.py", line 712, in <module>
    app.run(main)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/opt/project/submission_runner.py", line 680, in main
    score = score_submission_on_workload(
  File "/opt/project/submission_runner.py", line 585, in score_submission_on_workload
    timing, metrics = train_once(workload, workload_name,
  File "/opt/project/submission_runner.py", line 349, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/opt/project/reference_algorithms/paper_baselines/adamw/pytorch/submission.py", line 74, in update_params
    logits_batch, new_model_state = workload.model_fn(params=current_model,
  File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 170, in model_fn
    logits_batch = model(augmented_and_preprocessed_input_batch['inputs'])
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars.items)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/nn_module.py", line 302, in call_function
    return wrap_fx_proxy(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy
    return wrap_fx_proxy_cls(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1274, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1376, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1337, in get_fake_value
    return wrap_fake_exception(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1338, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1410, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1402, in run_node
    return nnmodule(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 110, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 694, in reraise
    raise exception
torch._dynamo.exc.TorchRuntimeError: Failed running call_module fn(*(FakeTensor(..., device='cuda:0', size=(16, 1, 28, 28)),), **{}):
Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/project/algorithmic_efficiency/workloads/mnist/mnist_pytorch/workload.py", line 43, in forward
    return self.net(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1113, in __torch_dispatch__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 448, in __call__
    return self._op(*args, **kwargs or {})
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1450, in dispatch
    return decomposition_table[func](*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 229, in _fn
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 70, in inner
    r = f(*tree_map(increase_prec, args), **tree_map(increase_prec, kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/_decomp/decompositions.py", line 1229, in addmm
    out = alpha * torch.mm(mat1, mat2)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1540, in dispatch
    with in_kernel_invocation_manager(self):
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 914, in in_kernel_invocation_manager
    assert meta_in_tls == prev_in_kernel, f"{meta_in_tls}, {prev_in_kernel}"
AssertionError: False, True

from user code:
   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Using --notorch_compile works as expected.