microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.32k stars 3.8k forks source link

[ci] [dask] CI jobs failing with Dask 2022.7.1 #5390

Open jameslamb opened 1 year ago

jameslamb commented 1 year ago

Description

Created from https://github.com/microsoft/LightGBM/pull/5388#issuecomment-1195848451.

All CUDA CI jobs and several Linux jobs are failing with the following.

FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =

client.restart() calls in that test are resulting in the following:

raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s") E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s

traceback (click me) ```text /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart return self.sync( /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync return sync( /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync raise exc.with_traceback(tb) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f result = yield future /root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run value = future.result() /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv raise exc.with_traceback(tb) /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm result = await result /root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper return await func(*args, **kwargs) ```

It looks like those jobs are getting dask and distributed 2022.7.1

    dask-2022.7.1              |     pyhd8ed1ab_0           5 KB  conda-forge
    dask-core-2022.7.1         |     pyhd8ed1ab_0         840 KB  conda-forge
    dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
    distributed-2022.7.1       |     pyhd8ed1ab_0         735 KB  conda-forge

which hit conda-forge 3 days ago.

Screen Shot 2022-07-26 at 1 34 56 PM

Reproducible example

Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true

I don't believe the failure is related to anything specific on the PR that that failed build came from.

Additional Comments

Note that this should not be a concern for jobs using Python < 3.8, as dask / distributed have dropped support for those Python versions.

Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0 (build link):

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - dask==2022.7.0 -> python[version='>=3.8']
  - distributed==2022.7.0 -> python[version='>=3.8']
jmoralez commented 1 year ago

Linking possible source: https://github.com/dask/distributed/pull/6714

jameslamb commented 1 year ago

Great find @jmoralez !!!

I just read through that PR, totally agree with you that it looks very relevant and is probably the root cause for this.

I wonder if the way that the failing test leaves the cluster with an error on one worker results in a situation where the worker can't shut down cleanly.

I believe the breaking change in that PR you linked is that restart() will now, by default, block until all workers shut down and restart successfully.

https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/tests/python_package_test/test_dask.py#L1643-L1652

Based on https://docs.dask.org/en/stable/_modules/distributed/client.html#Client.restart, I guess we could try something like the following to recover the old behavior?

client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)

Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart().

@jmoralez do you have time to look into this this week? I'd be happy to back off and let you take it if you're interested.


@gjoseph92 , also tagging you since you were the author of https://github.com/dask/distributed/pull/6714, and maybe whatever we're doing here in LightGBM is weird/unexpected and might be informative to you as you continue working on distributed.

gjoseph92 commented 1 year ago

Sorry about that. Yeah, client.restart is now stricter and verifies that the restart was actually successful.

I don't understand what you're doing with the socket, but presumably you're doing something that makes the worker block during Worker.close. I'm kinda curious what the motivation for that is / what you're trying to test (and if this is something LightGBM users would encounter and have issues with like this test does)? Another important question: what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?

Regardless, what @jameslamb said is what you'll want to match the previous behavior here:

client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)
gjoseph92 commented 1 year ago

Also, at some point soon-ish, I'm hoping to finish up and merge https://github.com/dask/distributed/pull/6427. This might break your test yet again.

Currently, the Nanny.kill process goes:

  1. Call Worker.close
  2. If that doesn't shut it down, send a SIGTERM to the worker. I think with your current test setup this will actually kill the worker process. But if you were running dask-worker (like users would be), it would just trigger Worker.close, so effectively a no-op.
  3. If it's still not closed after the timeout... 🤷

After that PR:

  1. Call Worker.close
  2. If that doesn't shut it down, send a SIGKILL to the worker. SIGKILL cannot be caught so the process will shut down nearly immediately (modulo it being suspended or currently waiting on a system call).
  3. If it's still not closed after the timeout, raise an error.

The key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can't block a SIGKILL, after this PR is merged I would expect (and hope) that client.restart would result in both workers restarting successfully, not just 1. If you're counting on the stuck worker to survive the restart, I don't think it would (it's a bug that it's able to right now).

jameslamb commented 1 year ago

Thanks very much for that @gjoseph92 ! And no need to apologize... my @ wasn't a passive-aggressive "hey why did you break our tests", I promise 😂

I'm kinda curious what the motivation for that is / what you're trying to test

LightGBM has its own distributed computing setup (e.g. to perform collective operations across multiple processes) that lightgbm.dask just wraps. In that setup, worker processes communicate over TCP sockets. For users in highly-restricted environments (e.g. behind a strict firewall), we provide the option to explicitly list the ports used for those sockets.

This test checks that the expected error is raised in the situation where someone specifies a port that is already in use at the time that training starts.

what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?

I suspect (and have suspected for a while) that the way we are raising errors on lightgbm.dask could be improved, and might result in leaving workers in a bad state.

https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/python-package/lightgbm/dask.py#L664-L671

I wouldn't be surprised to learn that the process is leaked and keeps running. Since our only use of Client.restart() is in a test and since that happens only on remote machines during CI, it would be easy for us to miss a leaked process.

If you're counting on the stuck worker to survive the restart

We definitely are not, and the new behavior you've described sounds like a nice improvement to Client.restart() 😁

jmoralez commented 1 year ago

Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart()

It's not the error itself, it's actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That's what I thought restart fixed but I'm not sure anymore haha. I think one better fix would be to add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them. I'll give that a shot and let you know.

jameslamb commented 1 year ago

add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them

oh interesting! LightGBM does support that by the way --> https://lightgbm.readthedocs.io/en/v3.3.2/Parameters.html#time_out

but it defaults to 120 minutes 😬

jmoralez commented 1 year ago

Nice! That should work, I'll try it.

jmoralez commented 1 year ago

It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part: https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/src/network/linkers_socket.cpp#L196-L216 So it doesn't wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted. image

I'll keep investigating this

jmoralez commented 1 year ago

Had no luck haha. I removed the restart from the test and I raised an exception when the network setup took too long but the test gets stuck because there are pending tasks and the worker seems to be dead.

StrikerRUS commented 1 year ago

Current master fails in Linux regular and Linux bdist jobs after 1h:

##[error]The job running on agent lgbubb5e0000SJI ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134

...

============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /__w/1/s
collected 719 items

../tests/c_api_test/test_.py ..                                          [  0%]
../tests/python_package_test/test_basic.py ............................. [  4%]
............................................                             [ 10%]
../tests/python_package_test/test_callback.py ............               [ 12%]
../tests/python_package_test/test_consistency.py ......                  [ 12%]
../tests/python_package_test/test_dask.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 17%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...FFFFFFFFFFFFFFFFF [ 27%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 37%]
FFFFFFFFFFFFFFFF
##[error]The operation was canceled.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13193&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11

StrikerRUS commented 1 year ago

Just checked the diff between dependencies in the last successful CI run and the current failing one. The only difference is in scipy

   scipy-1.8.1                |   py39h8ba3f38_2        25.0 MB  conda-forge

    scipy-1.9.0                |   py39h8ba3f38_0        26.2 MB  conda-forge
jameslamb commented 1 year ago

Thanks for capturing that before the logs expire @StrikerRUS !

Unless anyone gets to it sooner, I'll investigate this tonight.

StrikerRUS commented 1 year ago

It's quite strange that only Linux regular and Linux bdist are affected. scipy has been upgraded in Linux mpi_source, for example, as well...

StrikerRUS commented 1 year ago

Seems that pinning scipy version works.

StrikerRUS commented 1 year ago

scipy 1.9.0 changelog: http://scipy.github.io/devdocs/release.1.9.0.html

The only item there that in my opinion can be related to our issue is

SciPy switched to Meson as its build system

Specifically,

The build defaults to using OpenBLAS

jameslamb commented 1 year ago

That could definitely be related, especially if something on the dask-specific code paths leads to conflicting shared-library-loading (like some of the issues we've talked about in #5106).

jameslamb commented 1 year ago

I was able to reproduce this in Docker tonight, using the same image and environment variables as the Linux bdist job.

# Linux bdist
docker run \
    --rm \
    --env AZURE='true' \
    --env BUILD_DIRECTORY=/opt/LightGBM \
    --env CONDA_ENV='test-env' \
    --env LGB_VER=$(head -n 1 VERSION.txt) \
    --env PYTHON_VERSION='3.8' \
    --env TASK=bdist \
    --workdir=/opt/LightGBM \
    -v $(pwd):/opt/LightGBM \
    -it lightgbm/vsts-agent:ubuntu-14.04 \
    /bin/bash

export PATH="$CONDA/bin":${PATH}
${BUILD_DIRECTORY}/.ci/setup.sh
${BUILD_DIRECTORY}/.ci/test.sh

With those commands, as of latest master, scipy==1.8.1 is installed and all Dask tests succeed.

Then, I upgraded to scipy==1.9.0 and the Dask tests failed.

conda install \
    --yes \
    --name test-env \
    'scipy==1.9.0'
source activate test-env

# just target a single failing test, for faster feedback
pytest \
    -v \
    "$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]"

Running it like that revealed the underlying error!

>   self._handle = _dlopen(self._name, mode)
E   OSError: dlopen: cannot load any more object with static TLS

We've seen that issue in the Linux jobs on Ubuntu 14.04 before:

And as noted in https://github.com/conda-forge/conda-forge.github.io/issues/1551

This has some unfortunate side effects like how changing the import order between libraries makes the error appear / go away. Such kinds of accidents lead to the proliferation of unfortunate (because: randomly working or not) advice...

but, on the other hand, in https://github.com/pytorch/pytorch/issues/2575#issuecomment-523657178 (which is linked from that conda-forge issue)

importing something that gets libgomp early enough should be the recommended workaround.

(supported by https://github.com/pytorch/pytorch/issues/2575#issuecomment-773693437)

It's suggested in those linked issues (and in @StrikerRUS 's comment) that this error can be avoided by using a version of glibc > 2.21.

Unfortunately, it looks like the Ubuntu 14.04 image we have uses glibc 2.19.

docker run \
    --rm \
    -it lightgbm/vsts-agent:ubuntu-14.04 \
    ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.15) 2.19

So given that...maybe #5022 "worked" for so long because it resulted in different versions of some dependencies, which resulted in different library-loading order when importing them....and now the new scipy release has caused a slightly different library-loading order, which causes this issue to reoccur?

jameslamb commented 1 year ago

I tried running the tests with LD_DEBUG=libs set (see here for docs on that variable), to see what was being searched.

Like this.

LD_DEBUG=libs \
pytest \
    -v \
    "$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]" \
> ./out.txt 2>&1

Here's a sample of what is produced:

      2622: find library=libpthread.so.0 [0]; searching
      2622:  search path=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64:/opt/miniforge/envs/test-env/bin/../lib/tls:/opt/miniforge/envs/test-env/bin/../lib/x86_64:/opt/miniforge/envs/test-env/bin/../lib      (RPATH from file /opt/miniforge/envs/test-env/bin/python)
      2622:   trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64/libpthread.so.0
      2622:   trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/libpthread.so.0
      2622:   trying file=/opt/miniforge/envs/test-env/bin/../lib/x86_64/libpthread.so.0
      2622:   trying file=/opt/miniforge/envs/test-env/bin/../lib/libpthread.so.0
      2622:  search cache=/etc/ld.so.cache
      2622:   trying file=/lib/x86_64-linux-gnu/libpthread.so.0
      2622: 
      2622: find library=libdl.so.2 [0]; searching
      2622:  search path=/opt/miniforge/envs/test-env/bin/../lib        (RPATH from file /opt/miniforge/envs/test-env/bin/python)
      2622:   trying file=/opt/miniforge/envs/test-env/bin/../lib/libdl.so.2
      2622:  search cache=/etc/ld.so.cache
      2622:   trying file=/lib/x86_64-linux-gnu/libdl.so.2

Haven't had a chance to compare those outputs yet (they require a bit of modification to compare in a text differ), but wanted to put that command up here in case its useful for others investigating this or finding this issue from search engines.

StrikerRUS commented 1 year ago

@jameslamb Thank you very much for your investigation!

Maybe it's time to bump minimum required glibc version (2.14 -> 2.23) in LightGBM 4.0.0 and switch to Ubuntu 16.04 image for artifacts production? https://en.wikipedia.org/wiki/Glibc#Version_history

UPD: Or even better to one of the standard manylinux PyPI images (linking https://github.com/microsoft/LightGBM/issues/4484#issuecomment-889045289 and https://github.com/microsoft/LightGBM/pull/3421#issuecomment-711559332).

jameslamb commented 1 year ago

@StrikerRUS sorry it took me so long to respond to this!

I agree with the proposal to bump the minimum glibc version and to start publishing wheels with using one of the standard manylinux images.

jameslamb commented 1 year ago

start publishing wheels with using one of the standard manylinux images

@StrikerRUS I just returned from traveling and have some time to work on this.

But I saw that you've been pushing commits directly to dev: https://github.com/guolinke/lightgbm-ci-docker/compare/dev (noticed because of https://github.com/microsoft/LightGBM/pull/5252#issuecomment-1229541646).

Would you like me to wait until you're done with that effort to support the aarch64 wheels? And then would you like me to make a PR into dev in that repo with proposed changes to the image?

StrikerRUS commented 1 year ago

@jameslamb

have some time to work on this.

Nice to hear! 🙂

Would you like me to wait until you're done with that effort to support the aarch64 wheels?

It depends on how long it will take to merge #5252. I believe we need at least one more review there.

I think I can configure pushes to Docker Hub for a new branch, let's say dev-2, and you can test your ideas in that branch right now.

StrikerRUS commented 1 year ago

@jameslamb Done! Feel free to do your experiments in the dev2 branch.

You can check the diff for the .github/workflows/publish_image.yml file for how to make it possible to push multiple Dockerfiles into Docker Hub.

StrikerRUS commented 1 year ago

BTW, I guess we need to create a separate issue for discussing the migration to official manylinux image.

jameslamb commented 1 year ago

separate issue for discussing migration to official manylinux image

opened #5514

jameslamb commented 1 year ago

I'm working on this in #5597.

jameslamb commented 1 year ago

I've stopped working on this. Will post here if I start investigating it again in the future.

jameslamb commented 2 months ago

Just noting that even though this discussion is very old and references an old version of Dask, it shouldn't be closed until we've restored this unit test:

https://github.com/microsoft/LightGBM/blob/14435485bd7e9c4d72ab43bd269c33fc4230212f/tests/python_package_test/test_dask.py#L1339

Contributions welcome if anyone with Dask knowledge wants to investigate that!