Closed ronakice closed 4 years ago
Mesh reduce was never meant to exchange huge amount of data, but we can fix that by fixing the GRPC options. The issue with big data, especially on Colab, is that there is no ring reduce algorithm behind. So all data comes to the master.
As for the GRPC limit though, we do send GB of data in one swipe to TF, but TF GRPC has special configuration, that our GRPC does not.
We need to fix that, as the __init__.py
options refers to TF GRPC init.
Alternatively we could do some sharding at either the run_tpu_glue.py
script before syncing the tensors, or do the sharding directly at xm.mesh_reduce
?
@jysohn23 No need to. This is a trivial fix in mesh_service.
I still see this error on 1.6 release:
Received message larger than max (302809448 vs. 4194304) (8)
@davidel
Still the same error
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:299 : Failed to meet rendezvous 'tokens_ttl': Received message larger than max (13266904 vs. 4194304) (8)
Updating
Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000
in torch_xla/init.py didn't help
Can you reopen the ticket?
Our configuration is still there:
Can you try using nightly
?
@davidel we are using this instruction: https://github.com/pytorch/xla#DockerImage And in the vm All these following have the config ./usr/share/torch-xla-1.6/pytorch/xla/third_party/xla_client/mesh_service.cc ./usr/share/torch-xla-nightly/pytorch/xla/third_party/xla_client/mesh_service.cc ./usr/share/torch-xla-1.5/pytorch/xla/third_party/xla_client/mesh_service.cc
we are using torch-xla-1.6
env, but still get Received message larger than max (13266904 vs. 4194304) (8)
๐ Bug
torch_xla.core.xla_model.mesh_reduce(...) results in a RuntimeError:
tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'eval_lguids': Received message larger than max (5602816 vs. 4194304) (8)
Note that for context that I'm adapting @jysohn23's run_glue_tpu.py to MS-MARCO's passage ranking dataset (much larger than the individual GLUE datasets). I believe the equivalent lines of code in run_glue_tpu.py would be 271-272:
preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)
This probably has something to do with grpc's max send and receive limits. Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000 to os.environ['TF_GRPC_DEFAULT_OPTIONS'] in _setup_grpc() in the torchxla/\_init__.py file might help.
Building from source doesn't work on colab so I wasn't able to test if it did. I'm unsure if it can even take on limits of size 1 GB (Note I used a subset of the dataset which corresponds to 5602816, so a size of this much would be required by calculations) or if there is any other way to circumvent this issue. Although I do believe a 4MB limit is a bit too small for larger datasets. Thanks!
Environment