google-deepmind / reverb

Reverb is an efficient and easy-to-use data storage and transport system designed for machine learning research
Apache License 2.0
706 stars 93 forks source link

Quickstart example deadlocks on cluster #103

Closed simon-bachhuber closed 2 years ago

simon-bachhuber commented 2 years ago

When trying to run the quickstart example as shown in the readme the remote system deadlocks. The example runs just fine on my local pc. The remote system is part of the university cluster and the machines are all running Ubuntu. The following shows an IPython session executed on a node of the cluster.

Screenshot from 2022-06-09 09-20-51

When executing client.insert(...) the system hangs. I could imagine this might be an issue with ports? But i have really quite limited knowledge on this topic and any pointers would be highly appreciated. Thanks :)

acassirer commented 2 years ago

Hey,

I think the problem here is a sneaky one. You can see in the logs that a checkpoint is loaded. When loading a checkpoint it will use the configuration of the original table (i.e. the original rate_limiter etc.) and I would guess that this table had a rate_limiter capable of blocking inserts. Now when you try to insert the rate limiter blocks the insert forever (unless you sample concurrently from a different thread).

Some things to try:

simon-bachhuber commented 2 years ago

Thanks for your reponse!

I just deleted the /tmp/* folder but no luck. I also just gave the Server a Checkpoint with a new path which also didn't work. Regarding the client.server_info(): I can not run this! As soon as i run it the system hangs. It's like the client can not communicate to the server.

qstanczyk commented 2 years ago

Could you check whether connecting to the server by specifying IP address (not localhost) works? Maybe server somehow listens on a different interface, but "localhost loop" is prohibited?

simon-bachhuber commented 2 years ago

This seems to always just work

import socket
# replace `localhost` with 
socket.gethostname()