Open jameslamb opened 1 year ago
Linking possible source: https://github.com/dask/distributed/pull/6714
Great find @jmoralez !!!
I just read through that PR, totally agree with you that it looks very relevant and is probably the root cause for this.
I wonder if the way that the failing test leaves the cluster with an error on one worker results in a situation where the worker can't shut down cleanly.
I believe the breaking change in that PR you linked is that restart()
will now, by default, block until all workers shut down and restart successfully.
Based on https://docs.dask.org/en/stable/_modules/distributed/client.html#Client.restart, I guess we could try something like the following to recover the old behavior?
client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)
Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart()
.
@jmoralez do you have time to look into this this week? I'd be happy to back off and let you take it if you're interested.
@gjoseph92 , also tagging you since you were the author of https://github.com/dask/distributed/pull/6714, and maybe whatever we're doing here in LightGBM is weird/unexpected and might be informative to you as you continue working on distributed
.
Sorry about that. Yeah, client.restart
is now stricter and verifies that the restart was actually successful.
I don't understand what you're doing with the socket, but presumably you're doing something that makes the worker block during Worker.close
. I'm kinda curious what the motivation for that is / what you're trying to test (and if this is something LightGBM users would encounter and have issues with like this test does)? Another important question: what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?
Regardless, what @jameslamb said is what you'll want to match the previous behavior here:
client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)
Also, at some point soon-ish, I'm hoping to finish up and merge https://github.com/dask/distributed/pull/6427. This might break your test yet again.
Currently, the Nanny.kill
process goes:
Worker.close
dask-worker
(like users would be), it would just trigger Worker.close
, so effectively a no-op.After that PR:
Worker.close
The key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can't block a SIGKILL, after this PR is merged I would expect (and hope) that client.restart
would result in both workers restarting successfully, not just 1. If you're counting on the stuck worker to survive the restart, I don't think it would (it's a bug that it's able to right now).
Thanks very much for that @gjoseph92 ! And no need to apologize... my @
wasn't a passive-aggressive "hey why did you break our tests", I promise 😂
I'm kinda curious what the motivation for that is / what you're trying to test
LightGBM has its own distributed computing setup (e.g. to perform collective operations across multiple processes) that lightgbm.dask
just wraps. In that setup, worker processes communicate over TCP sockets. For users in highly-restricted environments (e.g. behind a strict firewall), we provide the option to explicitly list the ports used for those sockets.
This test checks that the expected error is raised in the situation where someone specifies a port that is already in use at the time that training starts.
what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?
I suspect (and have suspected for a while) that the way we are raising errors on lightgbm.dask
could be improved, and might result in leaving workers in a bad state.
I wouldn't be surprised to learn that the process is leaked and keeps running. Since our only use of Client.restart()
is in a test and since that happens only on remote machines during CI, it would be easy for us to miss a leaked process.
If you're counting on the stuck worker to survive the restart
We definitely are not, and the new behavior you've described sounds like a nice improvement to Client.restart()
😁
Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart()
It's not the error itself, it's actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That's what I thought restart
fixed but I'm not sure anymore haha. I think one better fix would be to add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them. I'll give that a shot and let you know.
add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them
oh interesting! LightGBM does support that by the way --> https://lightgbm.readthedocs.io/en/v3.3.2/Parameters.html#time_out
but it defaults to 120 minutes 😬
Nice! That should work, I'll try it.
It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part: https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/src/network/linkers_socket.cpp#L196-L216
So it doesn't wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted.
I'll keep investigating this
Had no luck haha. I removed the restart
from the test and I raised an exception when the network setup took too long but the test gets stuck because there are pending tasks and the worker seems to be dead.
Current master
fails in Linux regular
and Linux bdist
jobs after 1h:
##[error]The job running on agent lgbubb5e0000SJI ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134
...
============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /__w/1/s
collected 719 items
../tests/c_api_test/test_.py .. [ 0%]
../tests/python_package_test/test_basic.py ............................. [ 4%]
............................................ [ 10%]
../tests/python_package_test/test_callback.py ............ [ 12%]
../tests/python_package_test/test_consistency.py ...... [ 12%]
../tests/python_package_test/test_dask.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 17%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...FFFFFFFFFFFFFFFFF [ 27%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 37%]
FFFFFFFFFFFFFFFF
##[error]The operation was canceled.
Just checked the diff between dependencies in the last successful CI run and the current failing one. The only difference is in scipy
scipy-1.8.1 | py39h8ba3f38_2 25.0 MB conda-forge
scipy-1.9.0 | py39h8ba3f38_0 26.2 MB conda-forge
Thanks for capturing that before the logs expire @StrikerRUS !
Unless anyone gets to it sooner, I'll investigate this tonight.
It's quite strange that only Linux regular
and Linux bdist
are affected. scipy
has been upgraded in Linux mpi_source
, for example, as well...
Seems that pinning scipy
version works.
scipy
1.9.0 changelog: http://scipy.github.io/devdocs/release.1.9.0.html
The only item there that in my opinion can be related to our issue is
Specifically,
The build defaults to using OpenBLAS
That could definitely be related, especially if something on the dask
-specific code paths leads to conflicting shared-library-loading (like some of the issues we've talked about in #5106).
I was able to reproduce this in Docker tonight, using the same image and environment variables as the Linux bdist
job.
# Linux bdist
docker run \
--rm \
--env AZURE='true' \
--env BUILD_DIRECTORY=/opt/LightGBM \
--env CONDA_ENV='test-env' \
--env LGB_VER=$(head -n 1 VERSION.txt) \
--env PYTHON_VERSION='3.8' \
--env TASK=bdist \
--workdir=/opt/LightGBM \
-v $(pwd):/opt/LightGBM \
-it lightgbm/vsts-agent:ubuntu-14.04 \
/bin/bash
export PATH="$CONDA/bin":${PATH}
${BUILD_DIRECTORY}/.ci/setup.sh
${BUILD_DIRECTORY}/.ci/test.sh
With those commands, as of latest master
, scipy==1.8.1
is installed and all Dask tests succeed.
Then, I upgraded to scipy==1.9.0
and the Dask tests failed.
conda install \
--yes \
--name test-env \
'scipy==1.9.0'
source activate test-env
# just target a single failing test, for faster feedback
pytest \
-v \
"$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]"
Running it like that revealed the underlying error!
> self._handle = _dlopen(self._name, mode)
E OSError: dlopen: cannot load any more object with static TLS
We've seen that issue in the Linux jobs on Ubuntu 14.04 before:
And as noted in https://github.com/conda-forge/conda-forge.github.io/issues/1551
This has some unfortunate side effects like how changing the import order between libraries makes the error appear / go away. Such kinds of accidents lead to the proliferation of unfortunate (because: randomly working or not) advice...
but, on the other hand, in https://github.com/pytorch/pytorch/issues/2575#issuecomment-523657178 (which is linked from that conda-forge
issue)
importing something that gets
libgomp
early enough should be the recommended workaround.
(supported by https://github.com/pytorch/pytorch/issues/2575#issuecomment-773693437)
It's suggested in those linked issues (and in @StrikerRUS 's comment) that this error can be avoided by using a version of glibc > 2.21.
Unfortunately, it looks like the Ubuntu 14.04 image we have uses glibc 2.19.
docker run \
--rm \
-it lightgbm/vsts-agent:ubuntu-14.04 \
ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.15) 2.19
So given that...maybe #5022 "worked" for so long because it resulted in different versions of some dependencies, which resulted in different library-loading order when importing them....and now the new scipy
release has caused a slightly different library-loading order, which causes this issue to reoccur?
I tried running the tests with LD_DEBUG=libs
set (see here for docs on that variable), to see what was being searched.
Like this.
LD_DEBUG=libs \
pytest \
-v \
"$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]" \
> ./out.txt 2>&1
Here's a sample of what is produced:
2622: find library=libpthread.so.0 [0]; searching
2622: search path=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64:/opt/miniforge/envs/test-env/bin/../lib/tls:/opt/miniforge/envs/test-env/bin/../lib/x86_64:/opt/miniforge/envs/test-env/bin/../lib (RPATH from file /opt/miniforge/envs/test-env/bin/python)
2622: trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64/libpthread.so.0
2622: trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/libpthread.so.0
2622: trying file=/opt/miniforge/envs/test-env/bin/../lib/x86_64/libpthread.so.0
2622: trying file=/opt/miniforge/envs/test-env/bin/../lib/libpthread.so.0
2622: search cache=/etc/ld.so.cache
2622: trying file=/lib/x86_64-linux-gnu/libpthread.so.0
2622:
2622: find library=libdl.so.2 [0]; searching
2622: search path=/opt/miniforge/envs/test-env/bin/../lib (RPATH from file /opt/miniforge/envs/test-env/bin/python)
2622: trying file=/opt/miniforge/envs/test-env/bin/../lib/libdl.so.2
2622: search cache=/etc/ld.so.cache
2622: trying file=/lib/x86_64-linux-gnu/libdl.so.2
Haven't had a chance to compare those outputs yet (they require a bit of modification to compare in a text differ), but wanted to put that command up here in case its useful for others investigating this or finding this issue from search engines.
@jameslamb Thank you very much for your investigation!
Maybe it's time to bump minimum required glibc
version (2.14
-> 2.23
) in LightGBM 4.0.0 and switch to Ubuntu 16.04 image for artifacts production?
https://en.wikipedia.org/wiki/Glibc#Version_history
UPD: Or even better to one of the standard manylinux
PyPI images (linking https://github.com/microsoft/LightGBM/issues/4484#issuecomment-889045289 and https://github.com/microsoft/LightGBM/pull/3421#issuecomment-711559332).
@StrikerRUS sorry it took me so long to respond to this!
I agree with the proposal to bump the minimum glibc
version and to start publishing wheels with using one of the standard manylinux
images.
start publishing wheels with using one of the standard
manylinux
images
@StrikerRUS I just returned from traveling and have some time to work on this.
But I saw that you've been pushing commits directly to dev
: https://github.com/guolinke/lightgbm-ci-docker/compare/dev (noticed because of https://github.com/microsoft/LightGBM/pull/5252#issuecomment-1229541646).
Would you like me to wait until you're done with that effort to support the aarch64 wheels? And then would you like me to make a PR into dev
in that repo with proposed changes to the image?
@jameslamb
have some time to work on this.
Nice to hear! 🙂
Would you like me to wait until you're done with that effort to support the aarch64 wheels?
It depends on how long it will take to merge #5252. I believe we need at least one more review there.
I think I can configure pushes to Docker Hub for a new branch, let's say dev-2
, and you can test your ideas in that branch right now.
@jameslamb
Done! Feel free to do your experiments in the dev2
branch.
You can check the diff for the .github/workflows/publish_image.yml
file for how to make it possible to push multiple Dockerfiles into Docker Hub.
BTW, I guess we need to create a separate issue for discussing the migration to official manylinux
image.
separate issue for discussing migration to official
manylinux
image
opened #5514
I'm working on this in #5597.
I've stopped working on this. Will post here if I start investigating it again in the future.
Just noting that even though this discussion is very old and references an old version of Dask, it shouldn't be closed until we've restored this unit test:
Contributions welcome if anyone with Dask knowledge wants to investigate that!
Description
Created from https://github.com/microsoft/LightGBM/pull/5388#issuecomment-1195848451.
All CUDA CI jobs and several Linux jobs are failing with the following.
client.restart()
calls in that test are resulting in the following:traceback (click me)
```text /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart return self.sync( /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync return sync( /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync raise exc.with_traceback(tb) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f result = yield future /root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run value = future.result() /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv raise exc.with_traceback(tb) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm result = await result /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper return await func(*args, **kwargs) ```It looks like those jobs are getting
dask
anddistributed
2022.7.1which hit
conda-forge
3 days ago.Reproducible example
Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true
I don't believe the failure is related to anything specific on the PR that that failed build came from.
Additional Comments
Note that this should not be a concern for jobs using
Python < 3.8
, asdask
/distributed
have dropped support for those Python versions.Logs from an example build on #5388 where I tried to pin to exactly
dask==2022.7.0
(build link):