Closed davidberard98 closed 2 years ago
cc @anijain2305 @xuzhao9 - i recall having to handle the non-standard outputs from bert at the benchmark layer, but, i don't remember having to deal with handling them in the dynamo/fx layer. are we just calling/compiling the model differently now such that this shows up, or has something regressed?
@davidberard98 For your minimal repro, the error is because you missed super().__init__()
in MyModule
's constructor function. Adding that makes it work well.
Can we close this one @davidberard98 ?
Yes, I think this is not an issue now.
DistributedDataParallel is a
torch.nn
module, but it doesn't conform to some of the expectations for torch.nn modules (i.e. that return value is a tensor type). https://github.com/pytorch/torchdynamo/blob/main/torchdynamo/variables/nn_module.py#L190 is taken becauseis_allowed(mod.__class__)
is true. Then dynamo errors out because it expects a tensor type but gets something else.Repro: https://github.com/pytorch/benchmark/blob/wconstab/ddp_experiments/ddp_experiments.py on hf_Bert with 2 nodes with inductor backend. With pytorch at https://github.com/pytorch/pytorch/pull/83333 and dynamo at https://github.com/pytorch/torchdynamo/pull/628. In addition, patch pytorch by replacing https://github.com/pytorch/pytorch/blob/d05f07494a9a32c63f9218c0e703764a02033bb9/torch/nn/parallel/distributed.py#L981 with a nullcontext (to work around pytorch/pytorch#93668)
Error:
Full error: https://gist.github.com/davidberard98/e5054d628c0855cb560837600cd35399
This is my best effort at a minimal repro, but it fails with a different error.