Open rudra0713 opened 2 years ago
Thanks for your interest in LightGBM. I'll be able to look at this and provide a response in the coming days.
To ensure that maintainers can quickly give you an answer, please provide the following additional information (some of which was requested in the issue template):
dask
and distributed
Please also try to reduce the size of the example code to the simplest possible example which reproduces this behavior. For example, networking issues are unlikely to be related to parameters like learning_rate
or bagging_fraction
.
@rudra0713 did you run this twice in a row? The ports take a bit of time to be released once used, so if you try to reuse them the binding will most likely fail. Take the following example:
from urllib.parse import urlparse
import dask.array as da
import lightgbm as lgb
from dask.distributed import Client
client = Client(n_workers=2, threads_per_worker=1)
worker_addresses = client.scheduler_info()['workers'].keys()
one_worker_address = next(iter(worker_addresses))
host = urlparse(one_worker_address).hostname
machines = ','.join([f'{host}:{port}' for port in (12400, 12401)])
dX = da.random.random((1_000, 2), chunks=(250, 2))
dy = da.random.random(1_000, chunks=(250,))
lgb.DaskLGBMRegressor(n_estimators=5, machines=machines).fit(dX, dy) # succeeds
lgb.DaskLGBMRegressor(n_estimators=5, machines=machines).fit(dX, dy)
# LightGBMError: Binding port 12401 failed
@rudra0713 did you run this twice in a row? The ports take a bit of time to be released once used, so if you try to reuse them the binding will most likely fail. Take the following example:
from urllib.parse import urlparse import dask.array as da import lightgbm as lgb from dask.distributed import Client client = Client(n_workers=2, threads_per_worker=1) worker_addresses = client.scheduler_info()['workers'].keys() one_worker_address = next(iter(worker_addresses)) host = urlparse(one_worker_address).hostname machines = ','.join([f'{host}:{port}' for port in (12400, 12401)]) dX = da.random.random((1_000, 2), chunks=(250, 2)) dy = da.random.random(1_000, chunks=(250,)) lgb.DaskLGBMRegressor(n_estimators=5, machines=machines).fit(dX, dy) # succeeds lgb.DaskLGBMRegressor(n_estimators=5, machines=machines).fit(dX, dy) # LightGBMError: Binding port 12401 failed
No, I have run this code more than 5 times and every time, I restarted the cluster. Plus, there was significant time delays between each try.
@jameslamb Sorry about the missing information. I have added environment details and simplified my code.
On a related note, sometimes, I have seen messages like "Connecting to rank 7 failed, waiting for 120 ms" in the scheduler log. This originates from linkers_socket.cpp file. Can you kindly tell me what does "rank" mean and what's the meaning of "connecting to rank failed"?
Ranks are associated to each machine in your cluster, so if you have 8 machines you'll have rank 0, rank 1, ..., rank 7. What that message means is that the cluster couldn't connect to that machine and if those messages keep printing then it most likely means that machine died.
Hi @jmoralez @jameslamb , is there any update on this? My goal is to use the LightGBM socket version and submit multiple sequential calls. For examples,
for i in range(100, 150):
dis_res = invoke_dis_lgbm(X, y, i, port_counter)
Lightgbm uses random open ports for each trial, but in my production system, I do not have the capability to open a large pool of ports just so LightGBM can choose from them. What I need to do is give LightGBM a fixed set of ports, and make sure, that each trial uses the same set of port. That's why, I was relying on the machines parameter, but as I have already showed, in the log, it says binding port failed. Am I incorrectly setting the machines parameter?
Hi @jmoralez @jameslamb, I think, I have found the underlying problem for the socket binding failed error. As I mentioned already, I am using the machines param so that LightGBM uses a fixed set of port for each sequential trial. However, after each trial finishes (which takes approximately 7-8 sec in my case), it takes almost 60-70 sec, before all those ports are open again! From my runs, I saw that after trial 1, 2-3 ports will be open almost instantly, but the rest are still in use for a long period of time. Does this look like a bug to you or is it happening because of some communications issues between Dask and Lightgbm?
it takes almost 60-70 sec, before all those ports are open again.
Implementations of the TCP protocol often place a socket into a state called TIME_WAIT
for some period of time (on the order of a few minutes) after the socket has been closed on either side of the connection. That is done to prevent accidental delivery of data meant for one process to another.
This isn't something I'm deeply familiar with, but there are many discussions about this on Stack Overflow and other sites, e.g. https://serverfault.com/a/329846.
Does this look like a bug...?
It does not look like a bug to me. I don't believe current maintainers or the original authors of LightGBM's distributed training anticipated a use case like yours where you want to do many back-to-back training runs in very quick succession using a fixed set of machines + ports.
Am I incorrectly setting the machines parameter?
You are correct that machines
is the mechanism to use to limit LightGBM to a specific set of ports. It was designed for exactly the situation you've described, where limited ports are available based on your organization's firewall rules.
I don't know if you're setting machines
correctly, as you haven't shared with us how you are setting machines
yet. All the examples you've provided so far have been using a single local Dask cluster on 127.0.0.1
.
Does this look like a bug...?
It does not look like a bug to me. I don't believe current maintainers or the original authors of LightGBM's distributed training anticipated a use case like yours where you want to do many back-to-back training runs in very quick succession using a fixed set of machines + ports.
For example, let's say, I am doing hyperparameter optimization. I have some number of lightgbm configurations (where number of estimators or max depth of a tree are being varied). I want to submit all these Lightgbm calls parallelly or sequentially in a loop while using a fixed set of machines+ports. My dataset is quite large enough, so I need the support from dask/ distributed. What's your suggestion then regarding how to proceed about this?
I don't know if you're setting
machines
correctly, as you haven't shared with us how you are settingmachines
yet. All the examples you've provided so far have been using a single local Dask cluster on127.0.0.1
.
This is how I started using the machines parameter. port_counter is set to a known open port in my node like 10001 or 5501. Basically, additional_ports holds a list of ports that I know to be open in my machine(s).
client_lgbm = get_client()
machines_list = [key.split('//')[1].split(":")[0] for key in client_lgbm.has_what()]
additional_ports = range(port_counter, port_counter + num_of_workers)
machines_list = [machines_list[index] + ":" + str(port) for index, port in enumerate(additional_ports)]
machines = ','.join(m for m in machines_list)
lgbm_cls = lgb.DaskLGBMClassifier(client=client_lgbm, objective='binary', n_estimators=number_of_estimators,
machines=machines)
Are you familiar with TCP sockets and C++?
I'd support a change to make LightGBM's distributed training more resilient to the situation where it tries to bind to a port that has an existing socket on it in a status like TIME_WAIT
.
For example, LightGBM could wait longer before raising the following exception
Alternatively, LightGBM could be more aggressive about forcibly closing sockets and not leaving them in TIME_WAIT
(see, for example, this Stack Overflow answer).
If you're not comfortable attempting such contributions, then we could take that as a feature request.
Some things you could try that would work without any code changes in LightGBM:
local_listen_port
on each training run
nprocs=1
in your Dask settings), then this could work[12400, 12401, 12402]
, next run uses [12403, 12404, 12405]
, next run uses [12400, 12401, 12402]
(with hopefully enough available ports + training time per run such that by the time you try re-using the first group again, those ports are available again)time.sleep(240)
or similar after each training call
Description
In a single machine with 8 workers, I am trying to implement Distributed Lightgbm. To understand the role of the machines parameter. I have created a list of machines with 8 new ports (all of these ports are open). However, every time I pass the list of machines, I am getting the error "Binding Socket Failed" for all of the newly added ports. I am trying to figure out what I am doing wrong here.
Reproducible example
Here is the full log:
In the scheduler log, I can see that for all ports, port binding has failed.
Environment Information: Linux Lightgbm: 3.3.2.99 (built from source but I have already tried with pip install) dask: 2021.9.1 distributed: 2021.9.1 python 3.8.6