talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
435 stars 97 forks source link

ZMQ port hardcoded and not editable via the GUI #1774

Closed shrivaths16 closed 5 months ago

shrivaths16 commented 6 months ago

[WIP]

As of now we do not have an option to choose the ZMQ ports via the GUI and it has been hardcoded to tcp://127.0.0.1:9000 for the controller address and tcp://127.0.0.1:9001 for the publish address. Sometimes there is an issue when there are multiple SLEAP applications that are open and trained with, leading to "ZMQError: Address already in use" as mentioned in discussion #1751.

In order to solve this, we need to make some changes as listed below:

talmo commented 6 months ago

The ZMQConfig is used here to setup the TrainingControllerZMQ and ProgressReporterZMQ:

https://github.com/talmolab/sleap/blob/18aad91d10508127cc4f8a398432c43e661546a9/sleap/nn/training.py#L396-L412

Which is called from setup_output_callbacks here:

https://github.com/talmolab/sleap/blob/18aad91d10508127cc4f8a398432c43e661546a9/sleap/nn/training.py#L497

Which is called from Trainer._setup_outputs here:

https://github.com/talmolab/sleap/blob/18aad91d10508127cc4f8a398432c43e661546a9/sleap/nn/training.py#L851-L853

It uses the TrainingJobConfig to specify the ZMQ address/port, which derives from the loaded config file.

When calling from the CLI, we do already overwrite some parts of the config with the CLI provided options, for example, to enable/disable ZMQ entirely:

https://github.com/talmolab/sleap/blob/18aad91d10508127cc4f8a398432c43e661546a9/sleap/nn/training.py#L1924-L1928

Next to this block, we should also support specifying the ZMQ port explicitly and overwriting the appropriate config fields:

job_config.outputs.zmq.controller_address
job_config.outputs.zmq.publish_address
talmo commented 6 months ago

Another nice option could be to try to automatically detect a free port using Socket.bind_to_random_port().

We still need to know what the port is in order to pass it to the backend, so just calling this by itself wouldn't work, but we could use it to write a utility function to discover a free port, e.g.:


def find_free_port():
    ctx = zmq.Context.instance()
    socket = ctx.socket()
    port = socket.bind_to_random_port("tcp://127.0.0.1")
    socket.disconnect()
    return port