BUG: Behaviour differs if ray is imported and or initialized manually. Without manual import: it fails

Liquidmasl commented 3 months ago

Modin version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

# I tried getting close to my actual data here, but the individual breaking point will depend on the hardware I assume

import os
os.environ["MODIN_ENGINE"] = "ray"

# import ray
# ray.init()

import modin.pandas as pd
import numpy as np

def generate_dataframes(num_dataframes, num_rows, num_columns):
    dataframes = []
    for _ in range(num_dataframes):
        df = pd.DataFrame(np.random.rand(num_rows, num_columns))
        dataframes.append(df)
    return dataframes

# Parameters
num_dataframes = 1200
num_rows = 1000000
num_columns = 12

dataframes = generate_dataframes(num_dataframes, num_rows, num_columns)

big_df = pd.concat(dataframes, ignore_index=True)

big_df.to_parquet('big_df.parquet')

Issue Description

Writing this a second time because the first time my computer bluescreened (not the first time while troubleshooting this issue)

I am running modin with ray on my laptop, no cluster.

I am loading a file in batches, creating dfs from the batches, and concatenating them together. Then I try to save it using '.to_parquet()'

Depending on how and what I import, I get different results:

Import only modin:

import modin.pandas as pd

Concatenating the dataframes takes a very long time and needs a lot of ram. Calling .to_parquet() fills up my ram and page file and then fails with a killed raylet.

Importing ray manually:

import ray
import modin.pandas as pd

concatenating works just as before, trying to save to parquet fills RAM and pagefile (violently) until my computer freezes or bluescreens. The freeze is so extreme the only thing i can still do is cut the power.

Importing ray manually AND calling ray.init()

import ray
ray.init()
import modin.pandas as pd

This initializes 2 ray instances sadly, but: Leads to WAY faster df concatenation, and saving with to_parquet() also works (most of the time) In both operations it also barely fills up my ram (pagefile is chilling) (did manage to bluescreen here once as well though, might be unrelated) It creates a .parquet folder with 20 parquet files in it with about 1.5gigs each

Whats going on here? I just found out because I needed the ray dashboard to debug, so i called init manually, but then the issue was gone.

Also, on a different note: I cant actually load the parquet folder again that i just saved. Instant running OOM. Even though the same data in just one .parquet file loads flawlessly in just a second. whats up with that? Related to https://github.com/modin-project/modin/issues/7020#issuecomment-2263358385 ?

Expected Behavior

I would expect that the default import leads to good results. As it seams something is not adding up.

Error Logs

```python-traceback Replace this line with the error backtrace (if applicable). ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Austria.1252 Modin dependencies ------------------ modin : 0.31.0 ray : 2.34.0 dask : 2024.7.1 distributed : 2024.7.1 pandas dependencies ------------------- pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.2.2 pip : 24.1.2 Cython : 0.29.37 pytest : 8.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.2 numba : 0.60.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.29 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Liquidmasl commented 3 months ago

I somehow cant reproduce my success case anymore with real data. I dont understand what changed, but I cant save to parquet anymore after all

raylet.out tail:

``` [2024-08-05 17:47:19,270 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 20387 MiB, 11869 objects, read throughput 193 MiB/s [2024-08-05 17:47:23,156 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 71986 MiB, 24698 objects, write throughput 786 MiB/s. [2024-08-05 17:47:23,165 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022AFDE70000, 4294967304) [2024-08-05 17:47:23,181 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:23,689 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:25,362 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143 [2024-08-05 17:47:25,369 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 20393 MiB, 11873 objects, read throughput 183 MiB/s [2024-08-05 17:47:26,306 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:26,370 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 21410 MiB, 12465 objects, read throughput 190 MiB/s [2024-08-05 17:47:26,525 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:27,030 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022F45EC0000, 8589934600) [2024-08-05 17:47:27,461 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 21565 MiB, 12555 objects, read throughput 190 MiB/s [2024-08-05 17:47:28,464 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 22714 MiB, 13224 objects, read throughput 198 MiB/s [2024-08-05 17:47:29,468 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 23631 MiB, 13758 objects, read throughput 204 MiB/s [2024-08-05 17:47:30,471 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 24823 MiB, 14452 objects, read throughput 213 MiB/s [2024-08-05 17:47:31,476 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 25959 MiB, 15113 objects, read throughput 220 MiB/s [2024-08-05 17:47:32,298 I 9128 32964] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 12.5164 / 16.3946 GB - num bytes created total: 110237068996 1 pending objects of total size 2118MB - objects spillable: 1 - bytes spillable: 2221201143 - objects unsealed: 2 - bytes unsealed: 3602226 - objects in use: 5719 - bytes in use: 12516370955 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 2221211049 - objects restored: 5716 - bytes restored: 10295159906 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-05 17:47:33,749 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:38,373 I 9128 32964] (raylet.exe) store.cc:564: ========== Plasma store: ================= Current usage: 12.5164 / 16.3946 GB - num bytes created total: 110237068996 6 pending objects of total size 4243MB - objects spillable: 1 - bytes spillable: 2221201143 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 5719 - bytes in use: 12516370955 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 2221211049 - objects restored: 5716 - bytes restored: 10295159906 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-05 17:47:38,533 I 9128 18496] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager: [state-dump] Node ID: 769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d [state-dump] Node name: 127.0.0.1 [state-dump] InitialConfigResources: {node:127.0.0.1: 10000, GPU: 10000, CPU: 200000, accelerator_type:G: 10000, node:__internal_head__: 10000, object_store_memory: 163945525240000, memory: 327891050510000} [state-dump] ClusterTaskManager: [state-dump] ========== Node: 769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d ================= [state-dump] Infeasible queue length: 0 [state-dump] Schedule queue length: 0 [state-dump] Dispatch queue length: 0 [state-dump] num_waiting_for_resource: 0 [state-dump] num_waiting_for_plasma_memory: 0 [state-dump] num_waiting_for_remote_node_resources: 0 [state-dump] num_worker_not_started_by_job_config_not_exist: 0 [state-dump] num_worker_not_started_by_registration_timeout: 0 [state-dump] num_tasks_waiting_for_workers: 0 [state-dump] num_cancelled_tasks: 0 [state-dump] cluster_resource_scheduler state: [state-dump] Local id: 3413610774594749257 Local resources: {"total":{GPU: [10000], memory: [327891050510000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000], object_store_memory: [163945525240000], CPU: [200000]}}, "available": {memory: [327891050510000], GPU: [10000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000], object_store_memory: [141733447690000], CPU: [170000]}}, "labels":{"ray.io/node_id":"769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d",} is_draining: 0 is_idle: 0 Cluster resources: node id: 3413610774594749257{"total":{accelerator_type:G: 10000, object_store_memory: 163945525240000, memory: 327891050510000, GPU: 10000, node:127.0.0.1: 10000, node:__internal_head__: 10000, CPU: 200000}}, "available": {accelerator_type:G: 10000, object_store_memory: 141733447690000, node:127.0.0.1: 10000, node:__internal_head__: 10000, CPU: 170000, memory: 327891050510000, GPU: 10000}}, "labels":{"ray.io/node_id":"769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []} [state-dump] Waiting tasks size: 17 [state-dump] Number of executing tasks: 3 [state-dump] Number of pinned task arguments: 3704 [state-dump] Number of total spilled tasks: 0 [state-dump] Number of spilled waiting tasks: 0 [state-dump] Number of spilled unschedulable tasks: 0 [state-dump] Resource usage { [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=32748 worker_id=6ddb3c2d71dd94e0f0088bf37e3fcf1a670358cc7b455f4c1fee6b21): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=38700 worker_id=0ed61c2030588887b62701984884036d6394165acc00d7284cb41d07): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=26356 worker_id=10571ed2b2a722f937d05a922ea592292840df519e1ae3aa25732572): {CPU: 10000} [state-dump] } [state-dump] Running tasks by scheduling class: [state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=2f63268387a243db90a5dac06c8ea5d7} scheduling_strategy=default_scheduling_strategy { [state-dump] } [state-dump] resource_set={CPU : 1, }}: 3/20 [state-dump] ================================================== [state-dump] [state-dump] ClusterResources: [state-dump] LocalObjectManager: [state-dump] - num pinned objects: 1 [state-dump] - pinned objects size: 6612 [state-dump] - num objects pending restore: 1688 [state-dump] - num objects pending spill: 1 [state-dump] - num bytes pending spill: 2221201143 [state-dump] - num bytes currently spilled: 62177759756 [state-dump] - cumulative spill requests: 24698 [state-dump] - cumulative restore requests: 15596 [state-dump] - spilled objects pending delete: 0 [state-dump] [state-dump] ObjectManager: [state-dump] - num local objects: 5719 [state-dump] - num unfulfilled push requests: 0 [state-dump] - num object pull requests: 9885 [state-dump] - num chunks received total: 0 [state-dump] - num chunks received failed (all): 0 [state-dump] - num chunks received failed / cancelled: 0 [state-dump] - num chunks received failed / plasma error: 0 [state-dump] Event stats: [state-dump] Global stats: 8 total (0 active) [state-dump] Queueing time: mean = 32.360 us, max = 80.258 us, min = 10.471 us, total = 258.883 us [state-dump] Execution time: mean = 4.537 us, total = 36.296 us [state-dump] Event stats: [state-dump] ObjectManager.FreeObjects - 8 total (0 active), Execution time: mean = 4.537 us, total = 36.296 us, Queueing time: mean = 32.360 us, max = 80.258 us, min = 10.471 us, total = 258.883 us [state-dump] PushManager: [state-dump] - num pushes in flight: 0 [state-dump] - num chunks in flight: 0 [state-dump] - num chunks remaining: 0 [state-dump] - max chunks allowed: 409 [state-dump] OwnershipBasedObjectDirectory: [state-dump] - num listeners: 9885 [state-dump] - cumulative location updates: 31436 [state-dump] - num location updates per second: 0.000 [state-dump] - num location lookups per second: 0.000 [state-dump] - num locations added per second: 0.000 [state-dump] - num locations removed per second: 0.000 [state-dump] BufferPool: [state-dump] - create buffer state map size: 0 [state-dump] PullManager: [state-dump] - num bytes available for pulled objects: 3878181569 [state-dump] - num bytes being pulled (all): 6667728944 [state-dump] - num bytes being pulled / pinned: 3627450788 [state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - task request bundles: BundlePullRequestQueue{17 total, 3 active, 14 inactive, 0 unpullable} [state-dump] - first get request bundle: N/A [state-dump] - first wait request bundle: N/A [state-dump] - first task request bundle: 1236 objects, 2222582914 bytes (inactive, waiting for capacity) [state-dump] - num objects queued: 9885 [state-dump] - num objects actively pulled (all): 3704 [state-dump] - num objects actively pulled / pinned: 2016 [state-dump] - num bundles being pulled: 3 [state-dump] - num pull retries: 15509 [state-dump] - max timeout seconds: 40 [state-dump] - max timeout request is already processed. No entry. [state-dump] - example obj id pending pull: 00ffffffffffffffffffffffffffffffffffffff010000000b23f605 [state-dump] [state-dump] WorkerPool: [state-dump] - registered jobs: 1 [state-dump] - process_failed_job_config_missing: 0 [state-dump] - process_failed_rate_limited: 0 [state-dump] - process_failed_pending_registration: 0 [state-dump] - process_failed_runtime_env_setup_failed: 0 [state-dump] - num PYTHON workers: 28 [state-dump] - num PYTHON drivers: 1 [state-dump] - num object spill callbacks queued: 0 [state-dump] - num object restore queued: 1684 [state-dump] - num util functions queued: 0 [state-dump] - num idle workers: 17 [state-dump] TaskDependencyManager: [state-dump] - task deps map size: 17 [state-dump] - get req map size: 0 [state-dump] - wait req map size: 0 [state-dump] - local objects map size: 5719 [state-dump] WaitManager: [state-dump] - num active wait requests: 0 [state-dump] Subscriber: [state-dump] Channel WORKER_OBJECT_EVICTION [state-dump] - cumulative subscribe requests: 24702 [state-dump] - cumulative unsubscribe requests: 8 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 8 [state-dump] - cumulative processed messages: 8 [state-dump] Channel WORKER_REF_REMOVED_CHANNEL [state-dump] - cumulative subscribe requests: 0 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 0 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL [state-dump] - cumulative subscribe requests: 39517 [state-dump] - cumulative unsubscribe requests: 29632 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 47483 [state-dump] - cumulative processed messages: 32677 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 353895 total (79 active) [state-dump] Queueing time: mean = 19.490 ms, max = 387.349 s, min = -0.000 s, total = 6897.369 s [state-dump] Execution time: mean = 9.490 ms, total = 3358.373 s [state-dump] Event stats: [state-dump] ObjectManager.ObjectAdded - 40319 total (0 active), Execution time: mean = 350.596 us, total = 14.136 s, Queueing time: mean = 3.771 ms, max = 927.645 ms, min = 1.599 us, total = 152.043 s [state-dump] ObjectManager.ObjectDeleted - 34600 total (0 active), Execution time: mean = 3.454 us, total = 119.514 ms, Queueing time: mean = 30.240 ms, max = 1.025 s, min = 8.823 us, total = 1046.301 s [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 32677 total (0 active), Execution time: mean = 147.025 us, total = 4.804 s, Queueing time: mean = 75.026 ms, max = 712.568 ms, min = 63.484 us, total = 2451.636 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 24702 total (0 active), Execution time: mean = 6.426 ms, total = 158.744 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 24702 total (0 active), Execution time: mean = 723.704 us, total = 17.877 s, Queueing time: mean = 5.482 ms, max = 927.276 ms, min = 2.264 us, total = 135.418 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 24315 total (0 active), Execution time: mean = 22.131 us, total = 538.125 ms, Queueing time: mean = 982.294 us, max = 1.025 s, min = 2.151 us, total = 23.884 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 24315 total (0 active), Execution time: mean = 684.681 us, total = 16.648 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 24306 total (0 active), Execution time: mean = 638.580 us, total = 15.521 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 24306 total (0 active), Execution time: mean = 26.304 us, total = 639.346 ms, Queueing time: mean = 432.775 us, max = 1.183 s, min = 1.579 us, total = 10.519 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects - 15600 total (4 active), Execution time: mean = 28.853 ms, total = 450.106 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects.OnReplyReceived - 15596 total (0 active), Execution time: mean = 170.114 us, total = 2.653 s, Queueing time: mean = 1.451 ms, max = 501.009 ms, min = 1.459 us, total = 22.631 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 11542 total (0 active), Execution time: mean = 16.816 ms, total = 194.095 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 11542 total (0 active), Execution time: mean = 36.374 us, total = 419.833 ms, Queueing time: mean = 16.492 ms, max = 1.059 s, min = 1.762 us, total = 190.353 s [state-dump] NodeManager.SpillObjects - 6676 total (1 active), Execution time: mean = 188.136 us, total = 1.256 s, Queueing time: mean = 25.744 ms, max = 1.025 s, min = 2.328 us, total = 171.867 s [state-dump] NodeManager.GlobalGC - 6676 total (1 active), Execution time: mean = 697.991 ns, total = 4.660 ms, Queueing time: mean = 25.742 ms, max = 1.025 s, min = 4.080 us, total = 171.856 s [state-dump] NodeManager.CheckGC - 4216 total (1 active), Execution time: mean = 83.181 us, total = 350.690 ms, Queueing time: mean = 13.818 ms, max = 1.117 s, min = -0.000 s, total = 58.256 s [state-dump] ObjectManager.UpdateAvailableMemory - 4215 total (0 active), Execution time: mean = 2.301 ms, total = 9.700 s, Queueing time: mean = 2.061 ms, max = 911.750 ms, min = 2.216 us, total = 8.688 s [state-dump] RaySyncer.OnDemandBroadcasting - 4215 total (1 active), Execution time: mean = 428.190 us, total = 1.805 s, Queueing time: mean = 13.501 ms, max = 1.117 s, min = 82.643 us, total = 56.906 s [state-dump] RaySyncer.BroadcastMessage - 3531 total (0 active), Execution time: mean = 219.941 us, total = 776.612 ms, Queueing time: mean = 1.353 us, max = 511.349 us, min = 156.000 ns, total = 4.779 ms [state-dump] - 3531 total (1 active), Execution time: mean = 1.624 us, total = 5.735 ms, Queueing time: mean = 2.937 ms, max = 1.027 s, min = 2.589 us, total = 10.369 s [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 2258 total (1 active), Execution time: mean = 16.172 us, total = 36.517 ms, Queueing time: mean = 12.684 ms, max = 1.053 s, min = -0.000 s, total = 28.642 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 2220 total (5 active), Execution time: mean = 207.490 ms, total = 460.628 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 2215 total (0 active), Execution time: mean = 17.258 us, total = 38.227 ms, Queueing time: mean = 10.347 ms, max = 967.538 ms, min = 1.467 us, total = 22.919 s [state-dump] CoreWorkerService.grpc_client.LocalGC - 823 total (5 active), Execution time: mean = 440.409 ms, total = 362.457 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 818 total (0 active), Execution time: mean = 30.099 us, total = 24.621 ms, Queueing time: mean = 5.146 ms, max = 308.246 ms, min = 7.799 us, total = 4.209 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 475 total (0 active), Execution time: mean = 91.509 us, total = 43.467 ms, Queueing time: mean = 15.303 ms, max = 731.127 ms, min = 3.877 us, total = 7.269 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad - 475 total (0 active), Execution time: mean = 15.682 ms, total = 7.449 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.ScheduleAndDispatchTasks - 469 total (1 active), Execution time: mean = 348.364 us, total = 163.383 ms, Queueing time: mean = 24.594 ms, max = 763.148 ms, min = -0.000 s, total = 11.535 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 469 total (1 active), Execution time: mean = 391.263 us, total = 183.502 ms, Queueing time: mean = 24.543 ms, max = 762.978 ms, min = -0.000 s, total = 11.511 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 468 total (1 active), Execution time: mean = 1.755 ms, total = 821.451 ms, Queueing time: mean = 24.152 ms, max = 762.977 ms, min = 110.743 us, total = 11.303 s [state-dump] ClusterResourceManager.ResetRemoteNodeView - 159 total (1 active), Execution time: mean = 8.141 us, total = 1.294 ms, Queueing time: mean = 25.493 ms, max = 1.070 s, min = 89.436 us, total = 4.053 s [state-dump] ClientConnection.async_read.ProcessMessageHeader - 148 total (29 active), Execution time: mean = 13.389 us, total = 1.982 ms, Queueing time: mean = 15.137 s, max = 387.349 s, min = 8.087 us, total = 2240.230 s [state-dump] ClientConnection.async_read.ProcessMessage - 119 total (0 active), Execution time: mean = 6.902 ms, total = 821.373 ms, Queueing time: mean = 16.078 ms, max = 871.154 ms, min = 2.864 us, total = 1.913 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 114 total (1 active), Execution time: mean = 4.011 s, total = 457.201 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 113 total (0 active), Execution time: mean = 1.176 ms, total = 132.833 ms, Queueing time: mean = 89.132 ms, max = 1.178 s, min = 8.429 us, total = 10.072 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 96 total (0 active), Execution time: mean = 1.029 ms, total = 98.798 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 96 total (0 active), Execution time: mean = 41.960 us, total = 4.028 ms, Queueing time: mean = 7.564 ms, max = 703.496 ms, min = 3.812 us, total = 726.177 ms [state-dump] NodeManager.GcsCheckAlive - 96 total (1 active), Execution time: mean = 261.187 us, total = 25.074 ms, Queueing time: mean = 41.860 ms, max = 722.902 ms, min = 184.407 us, total = 4.019 s [state-dump] NodeManager.deadline_timer.record_metrics - 96 total (1 active), Execution time: mean = 354.789 us, total = 34.060 ms, Queueing time: mean = 41.782 ms, max = 722.885 ms, min = 242.923 us, total = 4.011 s [state-dump] NodeManagerService.grpc_server.GetNodeStats - 88 total (1 active), Execution time: mean = 1.710 s, total = 150.510 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 88 total (0 active), Execution time: mean = 6.723 ms, total = 591.644 ms, Queueing time: mean = 11.016 ms, max = 281.735 ms, min = 7.991 us, total = 969.405 ms [state-dump] NodeManager.deadline_timer.debug_state_dump - 48 total (1 active), Execution time: mean = 3.090 ms, total = 148.338 ms, Queueing time: mean = 58.900 ms, max = 723.917 ms, min = -0.000 s, total = 2.827 s [state-dump] CoreWorkerService.grpc_client.SpillObjects - 43 total (1 active), Execution time: mean = 3.862 s, total = 166.045 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 42 total (0 active), Execution time: mean = 111.446 ms, total = 4.681 s, Queueing time: mean = 62.842 ms, max = 387.449 ms, min = 8.059 us, total = 2.639 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 37 total (0 active), Execution time: mean = 11.567 ms, total = 427.961 ms, Queueing time: mean = 305.619 ms, max = 578.807 ms, min = 10.398 us, total = 11.308 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 37 total (17 active), Execution time: mean = 22.799 s, total = 843.559 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ClientConnection.async_write.DoAsyncWrites - 30 total (0 active), Execution time: mean = 1.728 us, total = 51.845 us, Queueing time: mean = 8.521 ms, max = 123.820 ms, min = 67.728 us, total = 255.616 ms [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 29 total (0 active), Execution time: mean = 69.950 ms, total = 2.029 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 29 total (0 active), Execution time: mean = 65.678 us, total = 1.905 ms, Queueing time: mean = 69.595 ms, max = 545.721 ms, min = 12.926 us, total = 2.018 s [state-dump] WorkerPool.PopWorkerCallback - 20 total (0 active), Execution time: mean = 24.273 ms, total = 485.460 ms, Queueing time: mean = 87.041 ms, max = 824.407 ms, min = 50.721 us, total = 1.741 s [state-dump] NodeManagerService.grpc_server.ReturnWorker - 17 total (0 active), Execution time: mean = 161.319 ms, total = 2.742 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.ReturnWorker.HandleRequestImpl - 17 total (0 active), Execution time: mean = 60.500 ms, total = 1.029 s, Queueing time: mean = 100.602 ms, max = 466.690 ms, min = 10.003 us, total = 1.710 s [state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 314.958 us, total = 3.780 ms, Queueing time: mean = 21.744 ms, max = 113.592 ms, min = 40.100 us, total = 260.934 ms [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_EVICTION - 8 total (0 active), Execution time: mean = 126.471 us, total = 1.012 ms, Queueing time: mean = 200.971 us, max = 570.158 us, min = 82.400 us, total = 1.608 ms [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 8 total (1 active, 1 running), Execution time: mean = 2.299 ms, total = 18.390 ms, Queueing time: mean = 8.998 ms, max = 18.059 ms, min = 2.549 ms, total = 71.981 ms [state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects - 6 total (0 active), Execution time: mean = 243.350 ms, total = 1.460 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects.OnReplyReceived - 6 total (0 active), Execution time: mean = 31.829 us, total = 190.973 us, Queueing time: mean = 69.500 ms, max = 416.933 ms, min = 11.997 us, total = 417.002 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 2 total (1 active), Execution time: mean = 2.070 s, total = 4.141 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 2.900 us, total = 5.800 us, Queueing time: mean = 750.000 ns, max = 1.200 us, min = 300.000 ns, total = 1.500 us [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 137.350 us, total = 274.700 us, Queueing time: mean = 1.840 ms, max = 3.445 ms, min = 235.500 us, total = 3.681 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 538.150 us, total = 1.076 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 558.200 us, total = 558.200 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 646.700 us, total = 646.700 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 904.800 us, total = 904.800 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 127.421 ms, total = 127.421 ms, Queueing time: mean = 12.800 us, max = 12.800 us, min = 12.800 us, total = 12.800 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 1 total (0 active), Execution time: mean = 946.834 us, total = 946.834 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 975.800 us, total = 975.800 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 1 total (0 active), Execution time: mean = 192.163 us, total = 192.163 us, Queueing time: mean = 11.301 us, max = 11.301 us, min = 11.301 us, total = 11.301 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 1 total (0 active), Execution time: mean = 73.135 us, total = 73.135 us, Queueing time: mean = 185.466 us, max = 185.466 us, min = 185.466 us, total = 185.466 us [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 21.800 us, total = 21.800 us, Queueing time: mean = 18.000 us, max = 18.000 us, min = 18.000 us, total = 18.000 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 376.600 us, total = 376.600 us, Queueing time: mean = 79.500 us, max = 79.500 us, min = 79.500 us, total = 79.500 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 227.800 us, total = 227.800 us, Queueing time: mean = 20.000 us, max = 20.000 us, min = 20.000 us, total = 20.000 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), Execution time: mean = 50.844 us, total = 50.844 us, Queueing time: mean = 174.838 us, max = 174.838 us, min = 174.838 us, total = 174.838 us [state-dump] DebugString() time ms: 3 [state-dump] [state-dump] [2024-08-05 17:47:38,742 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 74104 MiB, 24699 objects, write throughput 756 MiB/s. [2024-08-05 17:47:38,746 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022AFDE70000, 17179869192) [2024-08-05 17:47:38,746 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:40,993 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141 [2024-08-05 17:47:41,002 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 26790 MiB, 15597 objects, read throughput 210 MiB/s [2024-08-05 17:47:41,437 I 9128 18496] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 000000000000082C [2024-08-05 17:47:41,526 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:43,806 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:46,401 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 76223 MiB, 24700 objects, write throughput 740 MiB/s. [2024-08-05 17:47:46,508 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:48,416 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141 [2024-08-05 17:47:48,473 C 9128 32964] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450 *** StackTrace Information *** unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown ```

Liquidmasl commented 3 months ago

this seams to be the issue: https://github.com/modin-project/modin/issues/7361

modin-project / modin