Open kimishpatel opened 9 months ago
Hi Kimish, Should we use exir.capture or torch.export now that capture is deprecated?
I am having trouble with exir.capture.to_edge
when I run it using my model as arguments for capture it fails with the traceback listed below. It looks like the model fails to run to completion even though it runs to completion when executed outside of executorch.
I thought of adding constraints but I am not sure how to do that.
Please let me know if you need additional information
Thanks
Traceback (most recent call last):
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/tracer.py", line 667, in dynamo_trace
return torchdynamo.export(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 1213, in inner
result_traced = opt_f(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 401, in _fn
return fn(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 549, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 142, in _fn
return fn(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 384, in _convert_frame_assert
return _compile(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 570, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 221, in time_wrapper
r = func(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 492, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 462, in transform
tracer.run()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2107, in run
super().run()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 747, in run
and self.step()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 710, in step
getattr(self, inst.opname)(inst)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 405, in wrapper
return inner_fn(self, inst)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1143, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 582, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 307, in call_function
return super().call_function(tx, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 618, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2234, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2358, in inline_call_
tracer.run()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 747, in run
and self.step()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 710, in step
getattr(self, inst.opname)(inst)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 405, in wrapper
return inner_fn(self, inst)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1143, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 582, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/nn_module.py", line 309, in call_function
return wrap_fx_proxy(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1304, in wrap_fx_proxy
return wrap_fx_proxy_cls(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1391, in wrap_fx_proxy_cls
example_value = get_fake_value(proxy.node, tx)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1422, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1383, in get_fake_value
return wrap_fake_exception(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 952, in wrap_fake_exception
return fn()
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1384, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1483, in run_node
raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1467, in run_node
return nnmodule(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1323, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1529, in dispatch
return decomposition_table[func](*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_prims_common/wrappers.py", line 240, in _fn
result = fn(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_decomp/decompositions.py", line 72, in inner
r = f(*tree_map(increase_prec, args), **tree_map(increase_prec, kwargs))
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_decomp/decompositions.py", line 1306, in addmm
out = alpha * torch.mm(mat1, mat2)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1323, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1621, in dispatch
r = func(*args, **kwargs)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_ops.py", line 516, in __call__
return self._op(*args, **kwargs or {})
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_meta_registrations.py", line 1891, in meta_mm
torch._check(
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/__init__.py", line 1028, in _check
_check_with(RuntimeError, cond, message)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/__init__.py", line 1011, in _check_with
raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___encoder_embedding_linear_embd(*(FakeTensor(..., size=(0,), dtype=torch.float64),), **{}):
a and b must have same reduction dim, but got [1, 0] X [2, 512].
from user code:
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/model.py", line 473, in forward
enc_embed = self.encoder_embedding.forward(enc_input)
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/model.py", line 384, in forward
x = self.linear_embd(x) * math.sqrt(self.emb_size) # Shape = (B, N, C)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/train.py", line 297, in <module>
print(exir.capture(m, (VAL_INPUT, DEC_INPUT, DEC_SOURCE_MASK, DEC_TARGET_MASK)).to_edge())
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/capture/_capture.py", line 146, in capture
graph_module, _ = dynamo_trace(
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/tracer.py", line 686, in dynamo_trace
raise InternalError(
executorch.exir.error.InternalError: torchdynamo internal error occured. Please see above stacktrace
Hi, I produced an exported model with torch.export. If I save it using torch.export.save can I use the model file for inference in my Android application?
Put another way what is the equivalent of open("tfmodel.pte", "wb").write(.exir.capture().to_edge().to_executorch().buffer) when using torch.export?
Thanks
@adonnini
Let me know if this answers your question.
Hi @kimishpatel thanks for the response. I did save the model exported with torch.export without any problems. And, I did read the examples and related tutorials (several times).
Unfortunately, exir.capture does not work for me. As you can see from the traceback I posted in my message yesterday (please see above).
I also tried to load a saved model (and its dictionary) and then use it as one of the arguments for exir.capture. It failed with the traceback below.
torch.export works with a saved model.
I cannot see from the traceback above what causes the exir.capture failure. What do you think? What should I do next ?
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/train.py", line 334, in <module>
print(exir.capture(model_loaded, (VAL_INPUT, DEC_INPUT, DEC_SOURCE_MASK, DEC_TARGET_MASK)).to_edge())
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/program/_program.py", line 168, in to_edge
return _to_edge(self, config)
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/program/_program.py", line 283, in _to_edge
EXIRATenDialectVerifier()(ep.exported_program.graph_module)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_export/verifier.py", line 58, in __call__
self.check_valid(gm)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_export/verifier.py", line 117, in check_valid
self.check_valid_op(node.target)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/_export/verifier.py", line 166, in check_valid_op
raise SpecViolationError(
torch._export.verifier.SpecViolationError: Operator torch._ops.aten.detach.default is not Aten Canonical.
oh seems like aten.detach is not considered core op. For now make this, https://github.com/pytorch/executorch/blob/main/exir/capture/_config.py#L34, False. And try again. Note that you can pass it also via config to to_edge. cc: @guangy10
I made the change you suggested. Now the code fails with the following error:
Traceback (most recent call last):
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/fx/passes/infra/pass_manager.py", line 270, in __call__
res = fn(module)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/fx/passes/infra/pass_base.py", line 41, in __call__
self.ensures(graph_module)
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/passes/__init__.py", line 311, in ensures
raise RuntimeError(f"Missing out variants: {self.missing_out_vars}")
RuntimeError: Missing out variants: {'aten::alias'}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/train.py", line 340, in <module>
open("tfmodel.pte", "wb").write(exir.capture(model_loaded, (VAL_INPUT, DEC_INPUT, DEC_SOURCE_MASK, DEC_TARGET_MASK)).to_edge().to_executorch().buffer)
File "/home/adonnini1/Development/ContextQSourceCode/NeuralNetworks/trajectory-prediction-transformers-master/executorch/exir/program/_program.py", line 181, in to_executorch
new_prog = ep._transform(*edge_to_executorch_passes(config))
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/export/exported_program.py", line 569, in _transform
res = pm(self.graph_module)
File "/home/adonnini1/anaconda3/lib/python3.9/site-packages/torch/fx/passes/infra/pass_manager.py", line 296, in __call__
raise Exception(msg) from e
Exception: An error occurred when running the 'ToOutVarPass' pass after the following passes: ['SpecPropPass', 'EdgeToBackendOpsPass', 'RemoveAssertAsyncPass', 'HintBasedSymShapeEvalPass']
Stange. @larryliu0820 can you take a look. alias doesnt have out variant but alias_copy does in native_functions.yaml. Not sure why functionalization is not generating alias_copy. Maybe @bdhirsh knows
detach
should be removed from the graph, not sure why it sticks. For alias
, my understanding is functionalization may not replace it with an alias_copy
if we never change the value of the alias result. @bdhirsh answered this: https://discuss.pytorch.org/t/aten-ir-and-mutation-in-place/172129/2
So I think alias
can be removed because it's a no-op?
@kimishpatel , sorry to bother you. I was wondering when you think there will be an update on the issue https://github.com/pytorch/executorch/issues/1350 At this point, I am stuck as my app can load the model using the executorch runtime engine but cannot proceed further because of https://github.com/pytorch/executorch/issues/1350 Thanks
@kimishpatel , sorry to bother you. I was wondering when you think there will be an update on the issue pytorch/executorch#1350 At this point, I am stuck as my app can load the model using the executorch runtime engine but cannot proceed further because of pytorch/executorch#1350 Thanks
Apologies for late response. was on pto. Let me follow up on the issue
@kimishpatel sorry for coming back to you. I have received no response to the two issues that are blocking progress on my work.
https://github.com/pytorch/pytorch/issues/120219 and https://github.com/pytorch/executorch/issues/1350
I realize that we may still be during leave period. If that is the case, please let me know when I should touch base again.
Thanks for your patience and your help
@adonnini dont apologize. You have been very patient. Let me follow up and see whats happening.
Hi @kimishpatel , I hope you are well. I opened two issues: https://github.com/pytorch/executorch/issues/2204 https://github.com/pytorch/executorch/issues/2163 both at the beginning of last week. To date, I did not receive any feedback. I know your team is dealing with many issues (I can see the list of open issues getting longer). Would it be possible to let me know when someone will take a look at these two issues? I am getting closer to being able to run my models for inference from my Android application. I wish I could resolve these two problem. I can't without your help
@kimishpatel
@kimishpatel I don't know if I did something wrong but issue
https://github.com/pytorch/pytorch/issues/120219
has resurrfaced even though I used the
strict=False
workaround.
Please note that for executorch I used the main branch, not release
https://github.com/pytorch/executorch/releases/tag/v0.1.0
Yeah I don tknow the compatiblity with v0.1.0. I
@kimishpatel Hi, In the next few weeks we will start test deployments of the Android application. I would love to have the model run-for-inference function using executorch running by then. I truly am always hesitant to contact you knowing how much you have on your plate. I have note received any follow-up on these issues https://github.com/pytorch/executorch/issues/2204 https://github.com/pytorch/executorch/issues/2163 https://github.com/pytorch/pytorch/issues/120219 It's been a few weeks for all of them Please let me know if there is something I should be doing to help resolve these issues. Thanks
@kimishpatel I hope you are well. In a couple of weeks we will start deployment of my Android application. I would love to be able to include run for inference of the two models I am using to predict user location. Two issues: https://github.com/pytorch/executorch/issues/2163 https://github.com/pytorch/pytorch/issues/120219 are blocking my progress. It's been a few weeks since I have heard anything about their resolution @jansel tried to help me suggesting that https://github.com/pytorch/pytorch/pull/123318 might also resolve https://github.com/pytorch/pytorch/issues/120219 Unfortunately, it did not. I realize that it's a matter of priorities and that the team is focusing on the upcoming executorch release. Please let me know if there is anything I can do. Any update would be greatly appreciated. Thanks for your patience as you keep receiving my messages.
@adonnini thanks for bringing this back. Let me raise it internally and see what traction we get. I truly appreciate how you ahve been trying to make this work.
@kimishpatel A quick update. @angelayi has been very helpful in trying to solve https://github.com/pytorch/pytorch/issues/120219 We are making progress. I have not heard back regarding https://github.com/pytorch/executorch/issues/2163 I am waiting for a response from @lucylq (@kirklandsign asked her to take a look) Thanks
Ok let me ping them agai
@kimishpatel Thanks for your help. I really appreciate it! I think https://github.com/pytorch/executorch/issues/2163 is pretty close to being resolved. @kirklandsign was very helpful. Resolving https://github.com/pytorch/executorch/issues/2163 brought up again https://github.com/pytorch/executorch/issues/1350 which I have been waiting to hear since February. Thanks
@adonnini no problem and thank you for your patience. I really appreciate it. I think @kirklandsign should be able to help you resolve it. If not, please bring it to my attention again. Thanks
Thanks! With regards to https://github.com/pytorch/executorch/issues/1350 I was communicating with @mcr229 who told me that the fix to the issue would be available by around the end of February.
@adonnini do you know if the error happens only with android app or have you also trying running the model vai standalone binary, like executor runner, https://github.com/pytorch/executorch/tree/main/examples/portable/executor_runner
@kimishpatel I am not familiar with executor runner
. When I ran the model for inference outside of executorch it worked as expected. Is this what you were asking?
By the way, I ran the model for inference via my Android app lowering it using pytorch mobile (torchscript)
Thanks
Where are we?
Exporting pytorch model for ExecuTorch runtime goes through multiple AoT (Ahead of Time) stages. At high level there are 3 stages.
exir.capture
: This captures model’s graph using ATen IR.to_edge
: translate ATen dialect into edge dialect with dtype specialization.to_executorch
: translate edge dialect to executorch dialect, along with running various passes, e.g. out variant, memory planning etc., to make model ready for executorch runtime.Two important stops in model’s journey to executorch runtime are: a) quantization and b) delegation.
Entry points for quantization are between step 1 and 2. Thus quantization APIs consume ATen IR and are not edge/executorch specific.
Entry points for delegation are between step 2 and 3. Thus delegation APIs consume edge dialect IR.
Need for the export API change.
Quantization workflow is built on top of exir.capture which is built on top of torch.export API. In order to support QAT, such exported models need to work with eager mode autograd. Current export, of step 1 above, emits ATen IR with core ATen ops. This is not autograd safe, meaning it is not safe to run such an exported model in eager mode (e.g. in python), and, expect the autograd engine to work. Thus training APIs, such as calculating loss on the output and calling
backward
on the loss, are not guaranteed to work with this IR.It is important that quantization APIs, for QAT as well as PTQ, work on the same IR, because a) it provides better UX to the users and b) it provides a single IR that backend specific quantizers (read more here) can target.
For this reason we aligned on two stage export, that is rooted in the idea of progressive lowering. The two stages are:
Output of stage 1 is autograd safe and thus models exported at 1 can be trained via eager mode autograd engine.
New export API.
We are rolling out changes related to new export API in three stages.
Stage 1 (landed):
As shown in the figure below, exir.capture is broken down into:
capture_pre_autograd_graph
exir.capture
Example of exporting model without quantization:
Example of exporting model with quantization:
You can see these changes here and here for how quantization APIs fit in.
Stage 2 (coming soon):
We will deprecate exir.capture in favor of directly using torch.export. More updates on this will be posted soon.
Stage 3 (timeline is to be determined):
The two APIs listed in stage 1 will be renamed to:
torch.export
to_core_aten
torch.export will export graph with ATen IR, and full ATen opset, that is autograd safe, while to_core_aten will transform output of torch.export into core ATen IR that is NOT autograd safe.
Example of exporting model without quantization:
Example of exporting model with quantization:
Timeline for this is to be determined, but this will NOT happen before PyTorch conference on 10/16/2023.
Why this change?
There are a couple of reasons: This change aligns well with the long term state where capture_pre_autograd_graph is replaced with torch.export to obtain autograd safe aten IR, and the current use of exir.capture (or torch.export when replaced) will be replaced with to_core_aten to obtain ATen IR with core ATen opset.
In the long term, export for quantization wont be separate. Quantization will be an optional step, like delegation, in the export journey. Thus aligning with that in the short terms helps because:
Why the change now?
To minimize the migration pain later and have better alignment with the long term changes.