Closed jameslamb closed 3 years ago
@ffineis I've assigned this to myself because I'm actively working on it right now. If you have ideas of stuff to try, leave a comment here.
@StrikerRUS I know you're somewhat of a LightGBM historian :joy: . If you remember similar issues with other LightGBM parallel training, please link them here.
I know you're somewhat of a LightGBM historian 😂 .
🤣
If you remember similar issues with other LightGBM parallel training, please link them here.
Unfortunately I don't... Probably @imatiach-msft faced something similar.
As a side note, I'm not sure but I guess --depth=1
will not allow to execute git checkout ${LIGHTGBM_COMMIT}
with LIGHTGBM_COMMIT
other than the latest commit in the repo at the time of cloning the repo.
As a side note, I'm not sure but I guess --depth=1 will not allow to execute git checkout ${LIGHTGBM_COMMIT} with LIGHTGBM_COMMIT other than the latest commit in the repo at the time of cloning the repo.
ha oh yeah, guess I got lucky with the commit I chose. Fixed.
This morning, I tested the theory that LightGBM just cleaning up the network correctly. I tried to test this by changing local_listen_port
to a new value on each training run, so that each run was looking at a different and non-overlapping range of ports. But that did not work.
I added this to the beginning of each test iteration
local_listen_port += n_workers
And then ran 10 times in a row again.
n_workers = 3
local_listen_port = 12400
...
client.restart()
local_listen_port += n_workers
print(f"local_listen_port: {local_listen_port}")
num_rows = 1e6
num_features = 1e2
num_partitions = 10
rows_per_chunk = num_rows / num_partitions
data = da.random.random((num_rows, num_features), (rows_per_chunk, num_features))
labels = da.random.random((num_rows, 1), (rows_per_chunk, 1))
data = data.persist()
labels = labels.persist()
_ = wait(data)
_ = wait(labels)
dask_reg = DaskLGBMRegressor(
silent=False,
max_depth=5,
random_state=708,
objective="regression_l2",
learning_rate=0.1,
tree_learner="data",
n_estimators=10,
min_child_samples=1,
n_jobs=-1,
local_listen_port=local_listen_port
)
dask_reg.fit(client=client, X=data, y=labels)
NOTE I'm going to keep posting debugging findings here, but please feel free to unsubscribe from this issue. I'll open a PR or @
people when I think I've found a solution.
Summary
I tried that MPI version of LightGBM and found that this issue doesn't occur, but only because lightgbm.dask
doesn't correctly use the MPI version of LightGBM. That can be a separate issue.
@jameslamb I think this is all expected, I don't see anything wrong with lightgbm - ports could be used for a variety of reasons. The problem is with the dask-lightgbm code, since it doesn't try different ports. In mmlspark, we have special retry logic that tries a specific port and, if it doesn't work, tries the next port.
@jameslamb this is written in scala but it might be useful:
note how we try different ports starting from a user-specified default listen port up to the number of tasks on the machine - we don't assume that some port range will just always work. Once we have the ports we are interested in we send them back to the driver which aggregates them and then sends them back to all of the workers, which then call network init and start the parallel training.
Nice! Then it just sounds like the _build_network_params
method in dask.py
needs to be made a little more adaptable than the existing
addr_port_map = {addr: (local_listen_port + i) for i, addr in enumerate(worker_addresses)}
If/when the fix for finding available ports is getting developed - it'd be worth keeping in mind this related issue I've been coming across when worker memory reaches a critical mass and dask restarts the worker. Perhaps this issue could be tackled with port trying in the same go?
When dask restarts a worker, the new worker has a different port number, so the call to get_worker().address
in _train_part
isn't keeping up with the list of worker addresses defined in train
and we get a KeyException (the old worker address is no longer referring to a current Dask worker).
This issue was called out in dask-lightgbm just prior to 3515. Just mentioning this because it seems related to the issue at hand here.
thanks for noting that! and to @imatiach-msft for the ideas. I think that one issue I'm seeing is that lightgbm.dask
is returning before it's cleaned up the sockets it created.
I tried running training with my example above, and in a shell I ran the following
netstat -an | grep 124
During training on a LocalCluster
with three workers, I get a result like this
tcp 99154 0 127.0.0.1:55612 127.0.0.1:12402 ESTABLISHED
tcp 0 198308 127.0.0.1:12402 127.0.0.1:55612 ESTABLISHED
tcp 0 0 127.0.0.1:39462 127.0.0.1:12401 ESTABLISHED
tcp 0 39434 127.0.0.1:12401 127.0.0.1:39462 ESTABLISHED
tcp 412 0 127.0.0.1:55614 127.0.0.1:12402 ESTABLISHED
tcp 0 412 127.0.0.1:12402 127.0.0.1:55614 ESTABLISHED
Once training ends, that same command returns this
tcp 0 0 127.0.0.1:12402 127.0.0.1:55612 TIME_WAIT
tcp 0 0 127.0.0.1:12401 127.0.0.1:39462 TIME_WAIT
tcp 0 0 127.0.0.1:12402 127.0.0.1:55614 TIME_WAIT
I ran that every few seconds and it returned that result for about 2 minutes, then returned nothing. Once I saw that that command didn't return anything, re-running training succeeded.
So I think there are two issues:
TIME_WAIT
state for a few minutes after training endslightgbm.dask
is trying a fixed set of portsI think the "look for available ports first" solution is going to fix this. Hopefully I'll have a PR up shortly :grinning:
I'm not that familiar with the low-level details of TCP, but this blog post seems relevant to what's happening: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html.
The reason that TIME_WAIT can affect system scalability is that one socket in a TCP connection that is shut down cleanly will stay in the TIME_WAIT state for around 4 minutes. If many connections are being opened and closed quickly then socket's in TIME_WAIT may begin to accumulate on a system; you can view sockets in TIME_WAIT using netstat.
@jameslamb I agree with your findings - it would be great to find a cleaner way to close ports if possible. However, the current implementation is wrong anyway because it assumes all ports in those ranges are available. In practice, any application could be running and using some of those ports already - and the only way to find out if they are already used is actually to just try and bind to them. So even if the ports are closed in a cleaner way, it would still be better to find open ports first before assuming they are already open.
it would still be better to find open ports first before assuming they are already open.
totally agree, thanks for pointing me in the right direction!! I almost have something working right now, will open a PR soon.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
How you are using LightGBM?
LightGBM component: Python package
Environment info
Operating System: Ubuntu 18.04
C++ compiler version: 9.3.0
CMake version: 3.16.3
Python version: 3.8.5
LightGBM version or commit hash:
master
(https://github.com/microsoft/LightGBM/commit/78d31d9ae34c9c8ab2b0f8704c6962c8d3510062)Error message and / or logs
Training with
DaskLGBMRegressor
often fails with an error like this:The error doesn't ALWAYS happen, and training sometimes succeeds. It also doesn't always reference port 12402. I've found that the first time I call
DaskLGBMRegressor.fit()
, I don't see this error. After that, subsequent tries often result in the error.Here's the result of 10 calls of
.fit()
, withclient.restart()
run after each one.Steps to reproduce
This is based on https://github.com/jameslamb/talks/tree/main/recent-developments-in-lightgbm, but copying in the steps in case that repo changes in the future.
Dockerfile
with the following contentsThat should succeed, and if you click the printed link, you should see the Dask diagnostic dashboard.
client.restart()
, which clears the memory on all worker processes and removes any work from the scheduler.I expect that you'll see a similar pattern as the one noted above. Training will sometimes succeed, but often fail with an error like "cannot bind port XXXX".
Other information
I've noticed that often when this happens, it seems like maybe some of the worker processes were killed and restarted. I don't see messages about that in the logs, but the memory utilization for the workers is really uneven.
I've observed this behavior on
FargateCluster
s fromdask-cloudprovider
and on thedask-kubernetes
clusters from Saturn Cloud. So I don't think this issue is specific to the Dask docker image I used in the example below, or to the use ofLocalCluster
.I've also observed this behavior using dask-lightgbm built from current
master
, with LightGBM 3.0.0.Given all of that, my best guess is that there is some race condition where workers join the LightGBM cluster in a nondeterministic order, or maybe where two workers claim the same rank.