xm.mesh_reduce results in RuntimeError concerning message size

ronakice commented 4 years ago

🐛 Bug

torch_xla.core.xla_model.mesh_reduce(...) results in a RuntimeError:

tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'eval_lguids': Received message larger than max (5602816 vs. 4194304) (8)

Note that for context that I'm adapting @jysohn23's run_glue_tpu.py to MS-MARCO's passage ranking dataset (much larger than the individual GLUE datasets). I believe the equivalent lines of code in run_glue_tpu.py would be 271-272:

preds = xm.mesh_reduce("eval_preds", preds, np.concatenate) out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)

This probably has something to do with grpc's max send and receive limits. Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000 to os.environ['TF_GRPC_DEFAULT_OPTIONS'] in _setup_grpc() in the torchxla/\_init__.py file might help.

Building from source doesn't work on colab so I wasn't able to test if it did. I'm unsure if it can even take on limits of size 1 GB (Note I used a subset of the dataset which corresponds to 5602816, so a size of this much would be required by calculations) or if there is any other way to circumvent this issue. Although I do believe a 4MB limit is a bit too small for larger datasets. Thanks!

Environment

Reproducible on XLA backend [CPU/TPU]: TPU
torch_xla version: latest nightly
Running on google colab

dlibenzi commented 4 years ago

Mesh reduce was never meant to exchange huge amount of data, but we can fix that by fixing the GRPC options. The issue with big data, especially on Colab, is that there is no ring reduce algorithm behind. So all data comes to the master.

dlibenzi commented 4 years ago

As for the GRPC limit though, we do send GB of data in one swipe to TF, but TF GRPC has special configuration, that our GRPC does not. We need to fix that, as the __init__.py options refers to TF GRPC init.

dlibenzi commented 4 years ago

Our changes need to go here:

https://github.com/pytorch/xla/blob/78299228ddcd9c4139b8a38a8054212f14c23cc8/third_party/xla_client/mesh_service.cc#L167

Using:

https://grpc.github.io/grpc/cpp/classgrpc__impl_1_1_server_builder.html#af2b6e41d7ea8654b87a5b51853edd6e1

jysohn23 commented 4 years ago

Alternatively we could do some sharding at either the run_tpu_glue.py script before syncing the tensors, or do the sharding directly at xm.mesh_reduce?

dlibenzi commented 4 years ago

@jysohn23 No need to. This is a trivial fix in mesh_service.

dlibenzi commented 4 years ago

https://github.com/pytorch/xla/pull/1925

world2vec commented 4 years ago

I still see this error on 1.6 release: Received message larger than max (302809448 vs. 4194304) (8)

8key commented 4 years ago

@davidel Still the same error RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:299 : Failed to meet rendezvous 'tokens_ttl': Received message larger than max (13266904 vs. 4194304) (8) Updating Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000 in torch_xla/init.py didn't help

Can you reopen the ticket?

davidel commented 4 years ago

Our configuration is still there:

https://github.com/pytorch/xla/blob/2cd4f0724a251f23ecf684faab1d129b806880fa/third_party/xla_client/mesh_service.cc#L263

Can you try using nightly?

8key commented 4 years ago

@davidel we are using this instruction: https://github.com/pytorch/xla#DockerImage And in the vm All these following have the config ./usr/share/torch-xla-1.6/pytorch/xla/third_party/xla_client/mesh_service.cc ./usr/share/torch-xla-nightly/pytorch/xla/third_party/xla_client/mesh_service.cc ./usr/share/torch-xla-1.5/pytorch/xla/third_party/xla_client/mesh_service.cc

we are using torch-xla-1.6 env, but still get Received message larger than max (13266904 vs. 4194304) (8)

pytorch / xla

xm.mesh_reduce results in RuntimeError concerning message size #1924

🐛 Bug

Environment