llvm / torch-mlir

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
Other
1.31k stars 484 forks source link

Issue generating heavydep tests #1057

Open gpetters94 opened 2 years ago

gpetters94 commented 2 years ago

I'm unable to generate heavydep tests at TOM using the script in build_tools with the following output:

Traceback (most recent call last):
  File "/home/george/.pyenv/versions/3.8-dev-debug/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/george/.pyenv/versions/3.8-dev-debug/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/main.py", line 12, in <module>
    from . import train_models
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/train_models.py", line 143, in <module>
    neural_net_ts = generate_graph(neural_net_model, (input, ), training_fn)
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/train_models.py", line 65, in generate_graph
    fx_g = make_fx(training_fn,
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py", line 285, in wrapped
    t = dispatch_trace(wrap_key(f, args), tracer=fx_tracer, concrete_args=tuple(phs))
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py", line 178, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 714, in trace
    (self.create_arg(fn(*args)),),
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 549, in flatten_fn
    tree_out = root_fn(*tree_args)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py", line 202, in wrapped
    out = f(*tree_args)
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/train_models.py", line 131, in training_fn
    optim.step()
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 114, in wrapper
    return func(*args, **kwargs)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/autograd/profiler.py", line 451, in __exit__
    torch.ops.profiler._record_function_exit(self.handle)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/_ops.py", line 148, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: Expected temporary cpp type wrapper of type at::RecordFunction

At @makslevental's suggestion I ran sed -i.bak -E 's/if not hooked/if not True/g' /home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/optim/optimizer.py which fixes that error, but it goes on to fail with the following:

Traceback (most recent call last):
  File "/home/george/.pyenv/versions/3.8-dev-debug/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/george/.pyenv/versions/3.8-dev-debug/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/main.py", line 12, in <module>
    from . import train_models
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/train_models.py", line 143, in <module>
    neural_net_ts = generate_graph(neural_net_model, (input, ), training_fn)
  File "/home/george/mlir-npcomp/build_tools/torchscript_e2e_heavydep_tests/train_models.py", line 76, in generate_graph
    ts_g = torch.jit.script(fx_g)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/jit/_script.py", line 1286, in script
    return torch.jit._recursive.create_script_module(
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/jit/_recursive.py", line 476, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/home/george/mlir-npcomp/heavy_venv/lib/python3.8/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError: 
attribute lookup is not defined on builtin:
  File "<eval_with_key>.2", line 5
def forward(self, params_1, params_2, params_3, params_4, args_1):
    t_default = torch.ops.aten.t.default(params_1)
                ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    addmm_default = torch.ops.aten.addmm.default(params_2, args_1, t_default);  t_default = None
    relu_default = torch.ops.aten.relu.default(addmm_default);  addmm_default = None

To fix that I replaced ts_g = torch.jit.script(fx_g) in build_tools/torchscript_e2e_heavydep_tests/train_models.py with ts_g = torch.jit.trace(fx_g, inputs), but that fails with an argument mismatch:

forward() missing 4 required positional arguments: 'params_2', 'params_3', 'params_4', and 'args_1'

Needless to say, I think the bandaid solutions aren't working. Maks said this looked like an upstream bug, but I wanted to put this issue up to get more eyes on this and advice on how to move forward with this.

silvasean commented 2 years ago

Can you identify which test is failing and when it was added and if it ever worked?

gpetters94 commented 2 years ago

Can you identify which test is failing and when it was added and if it ever worked?

From my testing it seems to only be the two in train_models.py, namely the basic NeuralNet training and BERT training. Commenting them out fixes the issue.

gpetters94 commented 2 years ago

It was added in this commit. I haven't been able to make it work even reverting to that commit but I'll ask @pashu123 if he verified it when he wrote it.

zincnode commented 2 years ago

Hi, @gpetters94 , @silvasean I also encountered the problem of "attribute lookup is not defined on builtin", is there any solution or idea? I'm trying to export the forward and backward computation graph for Resnet50 (based on torch dialect).

log:

Traceback (most recent call last):
  File "004.py", line 37, in <module>
    aot_module(models.resnet50(pretrained=True), print_graph("forward"), print_graph("backward"))(torch.randn(1,3,200,200))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.local/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 702, in forward
    return compiled_f(
  File "/root/.local/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 621, in returned_function
    compiled_fn = create_aot_dispatcher_function(
  File "/root/.local/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 355, in create_aot_dispatcher_function
    aot_dispatch_autograd(flat_fn, fake_flat_tensor_args, aot_config)
  File "/root/.local/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 259, in aot_dispatch_autograd
    compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
  File "004.py", line 17, in f
    f_script = torch.jit.script(fx_g)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1286, in script
    return torch.jit._recursive.create_script_module(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 476, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError:
attribute lookup is not defined on builtin:
  File "<eval_with_key>.1", line 5
def forward(self, primals_1, primals_2, primals_3, primals_4, primals_5, primals_6, primals_7, primals_8, primals_9, primals_10, primals_11, primals_12, primals_13, primals_14, primals_15, primals_16, primals_17, primals_18, primals_19, primals_20, primals_21, primals_22, primals_23, primals_24, primals_25, primals_26, primals_27, primals_28, primals_29, primals_30, primals_31, primals_32, primals_33, primals_34, primals_35, primals_36, primals_37, primals_38, primals_39, primals_40, primals_41, primals_42, primals_43, primals_44, primals_45, primals_46, primals_47, primals_48, primals_49, primals_50, primals_51, primals_52, primals_53, primals_54, primals_55, primals_56, primals_57, primals_58, primals_59, primals_60, primals_61, primals_62, primals_63, primals_64, primals_65, primals_66, primals_67, primals_68, primals_69, primals_70, primals_71, primals_72, primals_73, primals_74, primals_75, primals_76, primals_77, primals_78, primals_79, primals_80, primals_81, primals_82, primals_83, primals_84, primals_85, primals_86, primals_87, primals_88, primals_89, primals_90, primals_91, primals_92, primals_93, primals_94, primals_95, primals_96, primals_97, primals_98, primals_99, primals_100, primals_101, primals_102, primals_103, primals_104, primals_105, primals_106, primals_107, primals_108, primals_109, primals_110, primals_111, primals_112, primals_113, primals_114, primals_115, primals_116, primals_117, primals_118, primals_119, primals_120, primals_121, primals_122, primals_123, primals_124, primals_125, primals_126, primals_127, primals_128, primals_129, primals_130, primals_131, primals_132, primals_133, primals_134, primals_135, primals_136, primals_137, primals_138, primals_139, primals_140, primals_141, primals_142, primals_143, primals_144, primals_145, primals_146, primals_147, primals_148, primals_149, primals_150, primals_151, primals_152, primals_153, primals_154, primals_155, primals_156, primals_157, primals_158, primals_159, primals_160, primals_161, primals_162, primals_163, primals_164, primals_165, primals_166, primals_167, primals_168, primals_169, primals_170, primals_171, primals_172, primals_173, primals_174, primals_175, primals_176, primals_177, primals_178, primals_179, primals_180, primals_181, primals_182, primals_183, primals_184, primals_185, primals_186, primals_187, primals_188, primals_189, primals_190, primals_191, primals_192, primals_193, primals_194, primals_195, primals_196, primals_197, primals_198, primals_199, primals_200, primals_201, primals_202, primals_203, primals_204, primals_205, primals_206, primals_207, primals_208, primals_209, primals_210, primals_211, primals_212, primals_213, primals_214, primals_215, primals_216, primals_217, primals_218, primals_219, primals_220, primals_221, primals_222, primals_223, primals_224, primals_225, primals_226, primals_227, primals_228, primals_229, primals_230, primals_231, primals_232, primals_233, primals_234, primals_235, primals_236, primals_237, primals_238, primals_239, primals_240, primals_241, primals_242, primals_243, primals_244, primals_245, primals_246, primals_247, primals_248, primals_249, primals_250, primals_251, primals_252, primals_253, primals_254, primals_255, primals_256, primals_257, primals_258, primals_259, primals_260, primals_261, primals_262, primals_263, primals_264, primals_265, primals_266, primals_267, primals_268, primals_269, primals_270, primals_271, primals_272, primals_273, primals_274, primals_275, primals_276, primals_277, primals_278, primals_279, primals_280, primals_281, primals_282, primals_283, primals_284, primals_285, primals_286, primals_287, primals_288, primals_289, primals_290, primals_291, primals_292, primals_293, primals_294, primals_295, primals_296, primals_297, primals_298, primals_299, primals_300, primals_301, primals_302, primals_303, primals_304, primals_305, primals_306, primals_307, primals_308, primals_309, primals_310, primals_311, primals_312, primals_313, primals_314, primals_315, primals_316, primals_317, primals_318, primals_319, primals_320, primals_321):
    convolution_default = torch.ops.aten.convolution.default(primals_321, primals_1, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1)
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    native_batch_norm_default = torch.ops.aten.native_batch_norm.default(convolution_default, primals_2, primals_3, primals_162, primals_163, True, 0.1, 1e-05);  primals_3 = None
    getitem = native_batch_norm_default[0]
silvasean commented 2 years ago

Does your code work if you do torch.ops.aten.convolution instead of torch.ops.aten.convolution.default?