[bug] regression tests failing on nightlies

felipemello1 commented 1 month ago

Our regression tests have been failing for a bit over a week: https://github.com/pytorch/torchtune/actions/workflows/regression_test.yaml

The reason is that it tries to run on nightlies, and fails to import NF4Tensor from torchao:

torchtune/models/phi3/__init__.py:7: in <module>
    from ._component_builders import lora_phi3, phi3  # noqa
torchtune/models/phi3/_component_builders.py:13: in <module>
    from torchtune.modules import (
torchtune/modules/__init__.py:8: in <module>
    from .common_utils import reparametrize_as_dtype_state_dict_post_hook
torchtune/modules/common_utils.py:12: in <module>
    from torchao.dtypes.nf4tensor import NF4Tensor
3/envs/test/lib/python3.11/site-packages/torchao/__init__.py:31: in <module>
    from torchao.quantization import (
3/envs/test/lib/python3.11/site-packages/torchao/quantization/__init__.py:7: in <module>
    from .smoothquant import *  # noqa: F403
3/envs/test/lib/python3.11/site-packages/torchao/quantization/smoothquant.py:18: in <module>
    import torchao.quantization.quant_api as quant_api
3/envs/test/lib/python3.11/site-packages/torchao/quantization/quant_api.py:[45](https://github.com/pytorch/torchtune/actions/runs/10361386495/job/28681608484#step:11:46): in <module>
    from .autoquant import autoquant, AutoQuantizableLinearWeight
3/envs/test/lib/python3.11/site-packages/torchao/quantization/autoquant.py:18: in <module>
    from torch._inductor.runtime.runtime_utils import do_bench
E   ImportError: cannot import name 'do_bench' from 'torch._inductor.runtime.runtime_utils' (/home/ec2-user/actions-runner/_work/torchtune/torchtune/3/envs/test/lib/python3.11/site-packages/torch/_inductor/runtime/runtime_utils.py)

The fix seems to be simple in torchao: replace

from torch._inductor.runtime.runtime_utils import do_bench

with

if torch.__version__ > pytorch_version_x:
    from torch._inductor.runtime.benchmarking import benchmarker
    do_bench = benchmarker.benchmark_gpu
else:
    from torch._inductor.runtime.runtime_utils import do_bench

But I am not sure if I am missing something, e.g. should we pin versions or run regression on nightlies?

cc: @ebsmothers @msaroufim (tagging you in case you have some opinion from the torchao point of vivew)

tikikun commented 4 weeks ago

there is no do_bench function in torch._inductor.autotune_process i

felipemello1 commented 4 weeks ago

@tikikun, isnt this it? https://github.com/pytorch/pytorch/blob/efc6e8457a221c6e70265fe895f8bc418d73aa0f/torch/_inductor/autotune_process.py#L508

edit: oh, i didnt pay attention to the content. It raises NotImplementedError :/

So I guess we need to find the PR that removed do_bench, and see what they did with this function, or if in torchao this function can be replaced

tikikun commented 4 weeks ago

@tikikun, isnt this it? https://github.com/pytorch/pytorch/blob/efc6e8457a221c6e70265fe895f8bc418d73aa0f/torch/_inductor/autotune_process.py#L508

this is a method from a class

felipemello1 commented 4 weeks ago

thanks for pointing it out. Here is the PR that changed it: https://github.com/pytorch/pytorch/pull/132827

pytorch / torchtune

[bug] regression tests failing on nightlies #1325