ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.17k stars 5.8k forks source link

Unable to use Ray within Replit (GLIBC issue) #38155

Open robertnishihara opened 1 year ago

robertnishihara commented 1 year ago

What happened + What you expected to happen

I tried running Ray inside of Replit

Replit installed Ray 2.6.2.

Running ray.init() in a Python interpreter leads to the GCS failing to start

$ cat /tmp/ray/session_2023-08-05_22-52-22_517985_13783/logs/gcs_server.err 
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libgcc_s.so.1)

Interestingly, directly trying to start the GCS seems to work

$ /home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server
[2023-08-05 22:54:06,501 I 14610 14610] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-08-05 22:54:06,503 I 14610 14610] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
...

[full output in a subsequent comment, but probably not relevant]

Although doing it from Python doesn't

~/HightechUniqueProgrammingmacro$ python
Python 3.10.8 (main, Oct 11 2022, 11:35:05) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_call(['/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server'])
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libstdc++.so.6)
/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /nix/store/mdck89nsfisflwjv6xv8ydj7dj0sj2pn-gcc-11.3.0-lib/lib/libgcc_s.so.1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nix/store/hd4cc9rh83j291r5539hkf6qd8lgiikb-python3-3.10.8/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server']' returned non-zero exit status 1.
>>> 

Versions / Dependencies

Ray 2.6.2 Python 3.10.8 Ubuntu 20.04.2

~/HightechUniqueProgrammingmacro$ ldd --version
ldd (GNU libc) 2.35
~/HightechUniqueProgrammingmacro$ ldd $PYTHONBIN
    linux-vdso.so.1 (0x00007f44f09a6000)
    libpython3.10.so.1.0 => /nix/store/hd4cc9rh83j291r5539hkf6qd8lgiikb-python3-3.10.8/lib/libpython3.10.so.1.0 (0x00007f44f05bb000)
    libcrypt.so.1 => /nix/store/xf0ssp8s6xjz710q33hspj5dphqhmmc1-libxcrypt-4.4.30/lib/libcrypt.so.1 (0x00007f44f0580000)
    libdl.so.2 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libdl.so.2 (0x00007f44f057b000)
    libm.so.6 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libm.so.6 (0x00007f44f049b000)
    libgcc_s.so.1 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libgcc_s.so.1 (0x00007f44f047f000)
    libc.so.6 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6 (0x00007f44f0276000)
    /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/ld-linux-x86-64.so.2 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib64/ld-linux-x86-64.so.2 (0x00007f44f09a8000)
~/HightechUniqueProgrammingmacro$ ldd /home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server
    linux-vdso.so.1 (0x00007ff720984000)
    libpthread.so.0 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libpthread.so.0 (0x00007ff720979000)
    libm.so.6 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libm.so.6 (0x00007ff720899000)
    librt.so.1 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/librt.so.1 (0x00007ff720894000)
    libstdc++.so.6 => not found
    libgcc_s.so.1 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libgcc_s.so.1 (0x00007ff72087a000)
    libc.so.6 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6 (0x00007ff71f5f7000)
    /lib64/ld-linux-x86-64.so.2 => /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib64/ld-linux-x86-64.so.2 (0x00007ff720986000)

Reproduction script

Just

import ray
ray.init()

Issue Severity

None

robertnishihara commented 1 year ago

Posting this since I ran into it and in case we're dong something incorrectly here.

robertnishihara commented 1 year ago

Output from running the gcs server directly

$ /home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server
[2023-08-05 22:54:06,501 I 14610 14610] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-08-05 22:54:06,503 I 14610 14610] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 77040f672047927100a0eb82ccc1e2fd89eaeab9536593fcd89e37a2
[2023-08-05 22:54:06,505 I 14610 14610] (gcs_server) gcs_server.cc:658: Autoscaler V2 enabled: 0
[2023-08-05 22:54:06,507 I 14610 14610] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 35893.
[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0

[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:872: Event stats:

Global stats: 30 total (16 active)
Queueing time: mean = 6.392 ms, max = 63.767 ms, min = 2.712 us, total = 191.748 ms
Execution time:  mean = 2.130 ms, total = 63.912 ms
Event stats:
    GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 14.297 us, total = 85.780 us
    InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalKVGcsService.grpc_server.InternalKVPut - 6 total (4 active), CPU time: mean = 3.946 us, total = 23.673 us
    GcsInMemoryStore.Put - 5 total (2 active), CPU time: mean = 12.756 ms, total = 63.779 ms
    PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 1.605 us, total = 6.422 us
    GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 17.155 us, total = 17.155 us
    UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:873: GcsTaskManager Event stats:

Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:

[2023-08-05 22:54:16,513 W 14610 14611] (gcs_server) metric_exporter.cc:212: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0

[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:872: Event stats:

Global stats: 317 total (4 active)
Queueing time: mean = 670.068 us, max = 63.767 ms, min = 2.712 us, total = 212.411 ms
Execution time:  mean = 215.844 us, total = 68.422 ms
Event stats:
    GcsInMemoryStore.Put - 75 total (0 active), CPU time: mean = 863.273 us, total = 64.745 ms
    InternalKVGcsService.grpc_client.InternalKVPut - 72 total (0 active), CPU time: mean = 11.649 us, total = 838.705 us
    InternalKVGcsService.grpc_server.InternalKVPut - 72 total (0 active), CPU time: mean = 10.209 us, total = 735.036 us
    RayletLoadPulled - 60 total (1 active), CPU time: mean = 6.621 us, total = 397.239 us
    UNKNOWN - 20 total (1 active), CPU time: mean = 7.575 us, total = 151.510 us
    GCSServer.deadline_timer.debug_state_dump - 6 total (1 active), CPU time: mean = 174.848 us, total = 1.049 ms
    GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 14.297 us, total = 85.780 us
    PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 100.618 us, total = 402.474 us
    GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 17.155 us, total = 17.155 us
    GCSServer.deadline_timer.debug_state_event_stats_print - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s

[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:873: GcsTaskManager Event stats:

Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:
anyscalesam commented 1 year ago

@jjyao are you picking this up for Ray 2.9?

anyscalesam commented 8 months ago

Reviewed - we weren't able to repro this. @jjyao to re-verify if true than can close.

aslonnie commented 5 months ago

it says this now:

/home/runner/raytest/.pythonlibs/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: error while loading shared libraries: /home/runner/raytest/.pythonlibs/lib/python3.10/site-packages/ray/core/libjemalloc.so: cannot allocate memory in static TLS block