ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.61k stars 5.71k forks source link

[Core] Check failed: *handle != nullptr CreateFileMapping() failed. Basically any operation that needs to spill a bunch of data #46990

Closed Liquidmasl closed 2 months ago

Liquidmasl commented 2 months ago

What happened + What you expected to happen

I am desperate, confused, and in need of help!

If I try to load a big dataset (using modin) with from_parquet, try to apply a function to a column, save to_parquet, etc a raylet dies with this error:

[2024-08-06 20:26:18,117 C 3460 27664] (raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455

Smaller datasets work fine.

I work on a single pc, with 64gigs ram and 20 logical processors on windows. a 43gb file (with 12 columns and around 12mrd rows) leads to issues

Setting _memory higher does not seam to change anything (anymore, it was nescessary to even get this far) Currently The logs suggest that it is spilling data just fine as soon as the object store is full, but at some point it just stops. Although I am not positive on that, as the logs are quite complicated. Also the dashboard seams to say 0B are used from the storage.

Dashboard still shows 0/200gb for Memory. (I have set _memory to 200gb in ray.init()) I had this issue earlier when my harddrive space was full (fair enough) now I have around 600gb free. I have set my virtual memory to 120gb (twice my ram), but that all did not help.

I am still unsure if this is a User/Hardware error, or some bug. I also wrote... a bunch.. of posts in modins issue board, here the most relevant one to this question: https://github.com/modin-project/modin/issues/7360

raylet.out tail:

[2024-08-06 20:25:08,655 I 3460 15984] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager: [state-dump] Node ID: 91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2 [state-dump] Node name: 127.0.0.1 [state-dump] InitialConfigResources: {accelerator_type:G: 10000, object_store_memory: 139284049920000, node:__internal_head__: 10000, CPU: 200000, memory: 2684354560000000, node:127.0.0.1: 10000, GPU: 10000} [state-dump] ClusterTaskManager: [state-dump] ========== Node: 91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2 ================= [state-dump] Infeasible queue length: 0 [state-dump] Schedule queue length: 0 [state-dump] Dispatch queue length: 1 [state-dump] num_waiting_for_resource: 0 [state-dump] num_waiting_for_plasma_memory: 1 [state-dump] num_waiting_for_remote_node_resources: 0 [state-dump] num_worker_not_started_by_job_config_not_exist: 0 [state-dump] num_worker_not_started_by_registration_timeout: 0 [state-dump] num_tasks_waiting_for_workers: 0 [state-dump] num_cancelled_tasks: 0 [state-dump] cluster_resource_scheduler state: [state-dump] Local id: 4129259612629820943 Local resources: {"total":{CPU: [200000], memory: [2684354560000000], GPU: [10000], node:127.0.0.1: [10000], object_store_memory: [139284049920000], node:__internal_head__: [10000], accelerator_type:G: [10000]}}, "available": {memory: [2684354560000000], node:__internal_head__: [10000], node:127.0.0.1: [10000], GPU: [10000], accelerator_type:G: [10000], object_store_memory: [30163939040000], CPU: [160000]}}, "labels":{"ray.io/node_id":"91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2",} is_draining: 0 is_idle: 0 Cluster resources: node id: 4129259612629820943{"total":{GPU: 10000, object_store_memory: 139284049920000, node:__internal_head__: 10000, CPU: 200000, memory: 2684354560000000, node:127.0.0.1: 10000, accelerator_type:G: 10000}}, "available": {CPU: 160000, memory: 2684354560000000, accelerator_type:G: 10000, object_store_memory: 30163939040000, node:__internal_head__: 10000, GPU: 10000, node:127.0.0.1: 10000}}, "labels":{"ray.io/node_id":"91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []} [state-dump] Waiting tasks size: 8 [state-dump] Number of executing tasks: 4 [state-dump] Number of pinned task arguments: 497 [state-dump] Number of total spilled tasks: 0 [state-dump] Number of spilled waiting tasks: 0 [state-dump] Number of spilled unschedulable tasks: 0 [state-dump] Resource usage { [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=30852 worker_id=cc0e9927c76d054bdb68d9bae75bf2d21704b97ed261f5071d72cc5a): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=23724 worker_id=b167e0d0ccf3b4b8de788e26726ea9ace544bbd10c433e5a8f62fb1a): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=3504 worker_id=54672d39444ed2c82f6b338557a73ef66d362244fb6642f4a851f8fa): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=28632 worker_id=9e06050ca7e4ca6f2e1794c352073a04a21c60e40db7a338efc42e8a): {CPU: 10000} [state-dump] } [state-dump] Running tasks by scheduling class: [state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=be419690f62648b68cb2b8f30b30d6c2} scheduling_strategy=default_scheduling_strategy { [state-dump] } [state-dump] resource_set={CPU : 1, }}: 4/20 [state-dump] ================================================== [state-dump] [state-dump] ClusterResources: [state-dump] LocalObjectManager: [state-dump] - num pinned objects: 1 [state-dump] - pinned objects size: 6612 [state-dump] - num objects pending restore: 0 [state-dump] - num objects pending spill: 4 [state-dump] - num bytes pending spill: 10912004476 [state-dump] - num bytes currently spilled: 55026642027 [state-dump] - cumulative spill requests: 2485 [state-dump] - cumulative restore requests: 1489 [state-dump] - spilled objects pending delete: 0 [state-dump] [state-dump] ObjectManager: [state-dump] - num local objects: 625 [state-dump] - num unfulfilled push requests: 0 [state-dump] - num object pull requests: 995 [state-dump] - num chunks received total: 0 [state-dump] - num chunks received failed (all): 0 [state-dump] - num chunks received failed / cancelled: 0 [state-dump] - num chunks received failed / plasma error: 0 [state-dump] Event stats: [state-dump] Global stats: 0 total (0 active) [state-dump] Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] Execution time: mean = -nan(ind) s, total = 0.000 s [state-dump] Event stats: [state-dump] PushManager: [state-dump] - num pushes in flight: 0 [state-dump] - num chunks in flight: 0 [state-dump] - num chunks remaining: 0 [state-dump] - max chunks allowed: 409 [state-dump] OwnershipBasedObjectDirectory: [state-dump] - num listeners: 995 [state-dump] - cumulative location updates: 3078 [state-dump] - num location updates per second: 0.000 [state-dump] - num location lookups per second: 0.000 [state-dump] - num locations added per second: 0.000 [state-dump] - num locations removed per second: 0.000 [state-dump] BufferPool: [state-dump] - create buffer state map size: 0 [state-dump] PullManager: [state-dump] - num bytes available for pulled objects: 0 [state-dump] - num bytes being pulled (all): 2232145226 [state-dump] - num bytes being pulled / pinned: 2232145226 [state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - task request bundles: BundlePullRequestQueue{9 total, 1 active, 8 inactive, 0 unpullable} [state-dump] - first get request bundle: N/A [state-dump] - first wait request bundle: N/A [state-dump] - first task request bundle: 125 objects, 2232145226 bytes (inactive, waiting for capacity) [state-dump] - num objects queued: 995 [state-dump] - num objects actively pulled (all): 125 [state-dump] - num objects actively pulled / pinned: 125 [state-dump] - num bundles being pulled: 1 [state-dump] - num pull retries: 164 [state-dump] - max timeout seconds: 20 [state-dump] - max timeout request is already processed. No entry. [state-dump] [state-dump] WorkerPool: [state-dump] - registered jobs: 2 [state-dump] - process_failed_job_config_missing: 0 [state-dump] - process_failed_rate_limited: 0 [state-dump] - process_failed_pending_registration: 0 [state-dump] - process_failed_runtime_env_setup_failed: 0 [state-dump] - num PYTHON workers: 28 [state-dump] - num PYTHON drivers: 2 [state-dump] - num object spill callbacks queued: 0 [state-dump] - num object restore queued: 0 [state-dump] - num util functions queued: 0 [state-dump] - num idle workers: 16 [state-dump] TaskDependencyManager: [state-dump] - task deps map size: 9 [state-dump] - get req map size: 0 [state-dump] - wait req map size: 0 [state-dump] - local objects map size: 625 [state-dump] WaitManager: [state-dump] - num active wait requests: 0 [state-dump] Subscriber: [state-dump] Channel WORKER_REF_REMOVED_CHANNEL [state-dump] - cumulative subscribe requests: 0 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 0 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL [state-dump] - cumulative subscribe requests: 3971 [state-dump] - cumulative unsubscribe requests: 2976 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 4686 [state-dump] - cumulative processed messages: 3079 [state-dump] Channel WORKER_OBJECT_EVICTION [state-dump] - cumulative subscribe requests: 2490 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 70554 total (64 active) [state-dump] Queueing time: mean = 17.896 ms, max = 210.608 s, min = -0.000 s, total = 1262.659 s [state-dump] Execution time: mean = 20.254 ms, total = 1429.012 s [state-dump] Event stats: [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 7605 total (0 active), Execution time: mean = 35.692 us, total = 271.439 ms, Queueing time: mean = 1.215 ms, max = 281.060 ms, min = 1.209 us, total = 9.239 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 7605 total (0 active), Execution time: mean = 1.441 ms, total = 10.962 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ObjectManager.ObjectAdded - 4001 total (0 active), Execution time: mean = 491.283 us, total = 1.966 s, Queueing time: mean = 806.685 us, max = 297.982 ms, min = 2.166 us, total = 3.228 s [state-dump] NodeManager.GlobalGC - 3630 total (0 active), Execution time: mean = 515.290 ns, total = 1.871 ms, Queueing time: mean = 447.539 us, max = 77.488 ms, min = 3.048 us, total = 1.625 s [state-dump] NodeManager.SpillObjects - 3630 total (0 active), Execution time: mean = 60.441 us, total = 219.400 ms, Queueing time: mean = 446.938 us, max = 77.492 ms, min = 2.429 us, total = 1.622 s [state-dump] ObjectManager.ObjectDeleted - 3376 total (0 active), Execution time: mean = 5.468 us, total = 18.459 ms, Queueing time: mean = 11.780 ms, max = 234.522 ms, min = 9.769 us, total = 39.769 s [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 3079 total (0 active), Execution time: mean = 68.904 us, total = 212.155 ms, Queueing time: mean = 12.185 ms, max = 100.138 ms, min = 115.347 us, total = 37.518 s [state-dump] NodeManager.CheckGC - 2732 total (1 active), Execution time: mean = 53.973 us, total = 147.453 ms, Queueing time: mean = 9.807 ms, max = 367.464 ms, min = -0.000 s, total = 26.793 s [state-dump] ObjectManager.UpdateAvailableMemory - 2732 total (0 active), Execution time: mean = 130.906 us, total = 357.636 ms, Queueing time: mean = 370.747 us, max = 123.836 ms, min = 2.025 us, total = 1.013 s [state-dump] RaySyncer.OnDemandBroadcasting - 2731 total (1 active), Execution time: mean = 165.315 us, total = 451.476 ms, Queueing time: mean = 9.736 ms, max = 367.246 ms, min = -0.000 s, total = 26.590 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 2671 total (0 active), Execution time: mean = 24.267 us, total = 64.816 ms, Queueing time: mean = 969.699 us, max = 421.228 ms, min = 2.361 us, total = 2.590 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 2671 total (0 active), Execution time: mean = 928.049 us, total = 2.479 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 2529 total (0 active), Execution time: mean = 25.200 us, total = 63.731 ms, Queueing time: mean = 132.194 us, max = 131.745 ms, min = 2.125 us, total = 334.318 ms [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 2529 total (0 active), Execution time: mean = 869.150 us, total = 2.198 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 2490 total (0 active), Execution time: mean = 2.300 ms, total = 5.726 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 2490 total (0 active), Execution time: mean = 768.744 us, total = 1.914 s, Queueing time: mean = 1.355 ms, max = 420.572 ms, min = 3.387 us, total = 3.375 s [state-dump] RaySyncer.BroadcastMessage - 1530 total (0 active), Execution time: mean = 186.959 us, total = 286.048 ms, Queueing time: mean = 901.692 ns, max = 22.566 us, min = 239.000 ns, total = 1.380 ms [state-dump] - 1530 total (0 active), Execution time: mean = 739.419 ns, total = 1.131 ms, Queueing time: mean = 214.433 us, max = 48.461 ms, min = 2.896 us, total = 328.083 ms [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects - 1489 total (0 active), Execution time: mean = 104.063 ms, total = 154.951 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects.OnReplyReceived - 1489 total (0 active), Execution time: mean = 125.665 us, total = 187.115 ms, Queueing time: mean = 837.597 us, max = 92.354 ms, min = 3.837 us, total = 1.247 s [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1457 total (1 active), Execution time: mean = 15.072 us, total = 21.959 ms, Queueing time: mean = 5.983 ms, max = 286.796 ms, min = -0.000 s, total = 8.718 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 1392 total (4 active), Execution time: mean = 125.492 ms, total = 174.685 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 1388 total (0 active), Execution time: mean = 13.727 us, total = 19.053 ms, Queueing time: mean = 1.585 ms, max = 61.248 ms, min = 2.021 us, total = 2.200 s [state-dump] CoreWorkerService.grpc_client.LocalGC - 552 total (2 active), Execution time: mean = 285.815 ms, total = 157.770 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 550 total (0 active), Execution time: mean = 22.035 us, total = 12.119 ms, Queueing time: mean = 3.909 ms, max = 59.582 ms, min = 4.353 us, total = 2.150 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 297 total (1 active), Execution time: mean = 206.715 us, total = 61.394 ms, Queueing time: mean = 10.208 ms, max = 114.737 ms, min = -0.000 s, total = 3.032 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad - 297 total (0 active), Execution time: mean = 1.828 ms, total = 542.780 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 297 total (1 active), Execution time: mean = 5.019 us, total = 1.491 ms, Queueing time: mean = 10.513 ms, max = 114.777 ms, min = -0.000 s, total = 3.122 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 297 total (0 active), Execution time: mean = 75.784 us, total = 22.508 ms, Queueing time: mean = 1.554 ms, max = 328.481 ms, min = 4.832 us, total = 461.671 ms [state-dump] NodeManager.ScheduleAndDispatchTasks - 297 total (1 active), Execution time: mean = 138.807 us, total = 41.226 ms, Queueing time: mean = 10.398 ms, max = 114.775 ms, min = -0.000 s, total = 3.088 s [state-dump] ClientConnection.async_read.ProcessMessageHeader - 136 total (30 active), Execution time: mean = 8.379 us, total = 1.140 ms, Queueing time: mean = 7.935 s, max = 210.608 s, min = 13.254 us, total = 1079.164 s [state-dump] ClientConnection.async_read.ProcessMessage - 106 total (0 active), Execution time: mean = 3.914 ms, total = 414.927 ms, Queueing time: mean = 207.368 us, max = 16.113 ms, min = 3.866 us, total = 21.981 ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 100 total (1 active), Execution time: mean = 8.010 us, total = 801.009 us, Queueing time: mean = 8.067 ms, max = 28.640 ms, min = 311.940 us, total = 806.669 ms [state-dump] CoreWorkerService.grpc_client.SpillObjects - 82 total (1 active), Execution time: mean = 1.122 s, total = 92.018 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 81 total (0 active), Execution time: mean = 3.230 ms, total = 261.632 ms, Queueing time: mean = 2.911 ms, max = 232.712 ms, min = 4.917 us, total = 235.796 ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 60 total (0 active), Execution time: mean = 30.239 us, total = 1.814 ms, Queueing time: mean = 672.951 us, max = 37.758 ms, min = 6.613 us, total = 40.377 ms [state-dump] NodeManager.GcsCheckAlive - 60 total (1 active), Execution time: mean = 232.566 us, total = 13.954 ms, Queueing time: mean = 10.661 ms, max = 74.457 ms, min = -0.000 s, total = 639.685 ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 60 total (0 active), Execution time: mean = 3.845 ms, total = 230.730 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.record_metrics - 60 total (1 active), Execution time: mean = 376.123 us, total = 22.567 ms, Queueing time: mean = 10.543 ms, max = 74.247 ms, min = -0.000 s, total = 632.578 ms [state-dump] NodeManagerService.grpc_server.GetNodeStats - 54 total (4 active), Execution time: mean = 2.385 s, total = 128.789 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 54 total (0 active), Execution time: mean = 3.416 ms, total = 184.490 ms, Queueing time: mean = 1.579 ms, max = 30.136 ms, min = 6.855 us, total = 85.269 ms [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 52 total (1 active), Execution time: mean = 5.164 s, total = 268.512 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 51 total (0 active), Execution time: mean = 424.185 us, total = 21.633 ms, Queueing time: mean = 5.378 ms, max = 131.855 ms, min = 5.629 us, total = 274.282 ms [state-dump] ClientConnection.async_write.DoAsyncWrites - 32 total (0 active), Execution time: mean = 1.644 us, total = 52.623 us, Queueing time: mean = 96.355 us, max = 185.153 us, min = 39.571 us, total = 3.083 ms [state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 30 total (0 active), Execution time: mean = 59.800 us, total = 1.794 ms, Queueing time: mean = 32.822 us, max = 320.264 us, min = 7.032 us, total = 984.654 us [state-dump] NodeManager.deadline_timer.debug_state_dump - 30 total (1 active), Execution time: mean = 2.666 ms, total = 79.978 ms, Queueing time: mean = 15.239 ms, max = 55.986 ms, min = 147.296 us, total = 457.159 ms [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 30 total (0 active), Execution time: mean = 301.116 us, total = 9.033 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 21 total (0 active), Execution time: mean = 1.966 ms, total = 41.282 ms, Queueing time: mean = 94.438 ms, max = 130.977 ms, min = 11.763 us, total = 1.983 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 21 total (9 active), Execution time: mean = 19.362 s, total = 406.593 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 421.867 us, total = 5.062 ms, Queueing time: mean = 11.223 ms, max = 45.753 ms, min = 48.400 us, total = 134.671 ms [state-dump] WorkerPool.PopWorkerCallback - 12 total (0 active), Execution time: mean = 19.260 ms, total = 231.123 ms, Queueing time: mean = 4.521 ms, max = 37.652 ms, min = 13.397 us, total = 54.248 ms [state-dump] NodeManagerService.grpc_server.ReturnWorker.HandleRequestImpl - 8 total (0 active), Execution time: mean = 15.544 ms, total = 124.350 ms, Queueing time: mean = 4.338 ms, max = 24.837 ms, min = 9.385 us, total = 34.707 ms [state-dump] NodeManagerService.grpc_server.ReturnWorker - 8 total (0 active), Execution time: mean = 20.075 ms, total = 160.603 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 5 total (1 active, 1 running), Execution time: mean = 1.768 ms, total = 8.841 ms, Queueing time: mean = 8.129 ms, max = 13.249 ms, min = 7.500 ms, total = 40.644 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 3 total (1 active), Execution time: mean = 5.184 s, total = 15.553 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 2 total (0 active), Execution time: mean = 168.188 us, total = 336.375 us, Queueing time: mean = 12.877 us, max = 15.365 us, min = 10.388 us, total = 25.753 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 2 total (0 active), Execution time: mean = 788.298 us, total = 1.577 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 1.472 ms, total = 2.943 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 276.050 us, total = 552.100 us, Queueing time: mean = 1.588 ms, max = 2.726 ms, min = 449.900 us, total = 3.176 ms [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 2 total (0 active), Execution time: mean = 78.681 us, total = 157.362 us, Queueing time: mean = 171.833 us, max = 197.168 us, min = 146.499 us, total = 343.667 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 2 total (0 active), Execution time: mean = 50.278 us, total = 100.556 us, Queueing time: mean = 188.075 us, max = 253.285 us, min = 122.864 us, total = 376.149 us [state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 5.250 us, total = 10.500 us, Queueing time: mean = 1.900 us, max = 3.500 us, min = 300.000 ns, total = 3.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 834.100 us, total = 834.100 us, Queueing time: mean = 135.700 us, max = 135.700 us, min = 135.700 us, total = 135.700 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 10.700 us, total = 10.700 us, Queueing time: mean = 283.100 us, max = 283.100 us, min = 283.100 us, total = 283.100 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.176 ms, total = 1.176 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 677.000 us, total = 677.000 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 952.700 us, total = 952.700 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 267.500 us, total = 267.500 us, Queueing time: mean = 313.800 us, max = 313.800 us, min = 313.800 us, total = 313.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 1.305 ms, total = 1.305 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 65.803 ms, total = 65.803 ms, Queueing time: mean = 113.400 us, max = 113.400 us, min = 113.400 us, total = 113.400 us [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2024-08-06 20:25:09,290 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:25:14,814 I 3460 15984] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 62883 MiB, 2489 objects, write throughput 690 MiB/s. [2024-08-06 20:25:14,820 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024BE4320000, 2734686216) [2024-08-06 20:25:14,836 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:15,068 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024B41710000, 2730491912) [2024-08-06 20:25:15,340 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024D2A340000, 2751463432) [2024-08-06 20:25:15,625 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024C87330000, 2734686216) [2024-08-06 20:25:17,935 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:17,935 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:17,935 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:17,936 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:17,976 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 25579 MiB, 1490 objects, read throughput 621 MiB/s [2024-08-06 20:25:18,831 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:18,998 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 27004 MiB, 1573 objects, read throughput 639 MiB/s [2024-08-06 20:25:19,389 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:19,390 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:25:19,470 I 3460 15984] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000824 [2024-08-06 20:25:19,495 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:20,122 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 27707 MiB, 1614 objects, read throughput 645 MiB/s [2024-08-06 20:25:20,321 I 3460 27664] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 19.0485 / 13.9284 GB - num bytes created total: 106390492159 1 pending objects of total size 17MB - objects spillable: 4 - bytes spillable: 10912004476 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 457 - bytes in use: 19048516361 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 452 - bytes restored: 8136505273 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:25:21,124 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 29819 MiB, 1737 objects, read throughput 679 MiB/s [2024-08-06 20:25:22,148 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 31948 MiB, 1861 objects, read throughput 711 MiB/s [2024-08-06 20:25:23,197 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 33407 MiB, 1946 objects, read throughput 726 MiB/s [2024-08-06 20:25:23,375 I 3460 27664] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 21.6407 / 13.9284 GB - num bytes created total: 112150849871 1 pending objects of total size 17MB - objects spillable: 0 - bytes spillable: 0 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 601 - bytes in use: 21640677329 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 596 - bytes restored: 10728666241 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:25:26,993 I 3460 27664] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 22.0727 / 13.9284 GB - num bytes created total: 112582876697 1 pending objects of total size 2601MB - objects spillable: 0 - bytes spillable: 0 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 625 - bytes in use: 22072704155 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 620 - bytes restored: 11160693067 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:25:29,419 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:25:39,528 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:25:44,512 I 3460 15984] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 73290 MiB, 2493 objects, write throughput 635 MiB/s. [2024-08-06 20:25:44,515 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:44,515 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024B41710000, 2751463432) [2024-08-06 20:25:44,779 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024BE5720000, 2818572296) [2024-08-06 20:25:45,051 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024D3D740000, 3221225480) [2024-08-06 20:25:45,320 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024C8D730000, 2952790024) [2024-08-06 20:25:47,588 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:47,588 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:47,588 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:47,588 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:25:47,655 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 34094 MiB, 1986 objects, read throughput 701 MiB/s [2024-08-06 20:25:48,418 I 3460 15984] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000814 [2024-08-06 20:25:48,420 I 3460 15984] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 000000000000083C [2024-08-06 20:25:48,466 I 3460 15984] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000834 [2024-08-06 20:25:48,494 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:25:48,684 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 34797 MiB, 2027 objects, read throughput 700 MiB/s [2024-08-06 20:25:49,626 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:25:49,728 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 36291 MiB, 2114 objects, read throughput 715 MiB/s [2024-08-06 20:25:49,942 I 3460 27664] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 15.8803 / 13.9284 GB - num bytes created total: 126231051088 1 pending objects of total size 17MB - objects spillable: 4 - bytes spillable: 10912004476 - objects unsealed: 2 - bytes unsealed: 36002236 - objects in use: 281 - bytes in use: 15880319614 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 276 - bytes restored: 4968308526 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:25:50,764 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 38334 MiB, 2233 objects, read throughput 740 MiB/s [2024-08-06 20:25:51,786 I 3460 15984] (raylet.exe) local_object_manager.cc:490: Restored 40342 MiB, 2350 objects, read throughput 764 MiB/s [2024-08-06 20:25:51,832 I 3460 27664] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 19.8406 / 13.9284 GB - num bytes created total: 130191297154 1 pending objects of total size 2601MB - objects spillable: 0 - bytes spillable: 0 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 502 - bytes in use: 19840565680 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 497 - bytes restored: 8928554592 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:25:59,649 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:26:08,599 I 3460 27664] (raylet.exe) store.cc:564: ========== Plasma store: ================= Current usage: 19.8406 / 13.9284 GB - num bytes created total: 130191297154 5 pending objects of total size 13008MB - objects spillable: 0 - bytes spillable: 0 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 502 - bytes in use: 19840565680 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 5 - bytes created by worker: 10912011088 - objects restored: 497 - bytes restored: 8928554592 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-06 20:26:08,670 I 3460 15984] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager: [state-dump] Node ID: 91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2 [state-dump] Node name: 127.0.0.1 [state-dump] InitialConfigResources: {accelerator_type:G: 10000, object_store_memory: 139284049920000, node:__internal_head__: 10000, CPU: 200000, memory: 2684354560000000, node:127.0.0.1: 10000, GPU: 10000} [state-dump] ClusterTaskManager: [state-dump] ========== Node: 91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2 ================= [state-dump] Infeasible queue length: 0 [state-dump] Schedule queue length: 0 [state-dump] Dispatch queue length: 0 [state-dump] num_waiting_for_resource: 0 [state-dump] num_waiting_for_plasma_memory: 0 [state-dump] num_waiting_for_remote_node_resources: 0 [state-dump] num_worker_not_started_by_job_config_not_exist: 0 [state-dump] num_worker_not_started_by_registration_timeout: 0 [state-dump] num_tasks_waiting_for_workers: 0 [state-dump] num_cancelled_tasks: 0 [state-dump] cluster_resource_scheduler state: [state-dump] Local id: 4129259612629820943 Local resources: {"total":{CPU: [200000], memory: [2684354560000000], GPU: [10000], node:127.0.0.1: [10000], object_store_memory: [139284049920000], node:__internal_head__: [10000], accelerator_type:G: [10000]}}, "available": {memory: [2684354560000000], node:__internal_head__: [10000], node:127.0.0.1: [10000], GPU: [10000], accelerator_type:G: [10000], object_store_memory: [30163939040000], CPU: [160000]}}, "labels":{"ray.io/node_id":"91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2",} is_draining: 0 is_idle: 0 Cluster resources: node id: 4129259612629820943{"total":{accelerator_type:G: 10000, GPU: 10000, node:127.0.0.1: 10000, object_store_memory: 139284049920000, CPU: 200000, memory: 2684354560000000, node:__internal_head__: 10000}}, "available": {CPU: 160000, node:127.0.0.1: 10000, memory: 2684354560000000, accelerator_type:G: 10000, object_store_memory: 30163939040000, node:__internal_head__: 10000, GPU: 10000}}, "labels":{"ray.io/node_id":"91f33d1be5043cf02cbfa58ec1e53b8fca6bb778d09eeaf38edafed2",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []} [state-dump] Waiting tasks size: 1 [state-dump] Number of executing tasks: 4 [state-dump] Number of pinned task arguments: 497 [state-dump] Number of total spilled tasks: 0 [state-dump] Number of spilled waiting tasks: 0 [state-dump] Number of spilled unschedulable tasks: 0 [state-dump] Resource usage { [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=30852 worker_id=cc0e9927c76d054bdb68d9bae75bf2d21704b97ed261f5071d72cc5a): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=23724 worker_id=b167e0d0ccf3b4b8de788e26726ea9ace544bbd10c433e5a8f62fb1a): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=3504 worker_id=54672d39444ed2c82f6b338557a73ef66d362244fb6642f4a851f8fa): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=28632 worker_id=9e06050ca7e4ca6f2e1794c352073a04a21c60e40db7a338efc42e8a): {CPU: 10000} [state-dump] } [state-dump] Running tasks by scheduling class: [state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=be419690f62648b68cb2b8f30b30d6c2} scheduling_strategy=default_scheduling_strategy { [state-dump] } [state-dump] resource_set={CPU : 1, }}: 4/20 [state-dump] ================================================== [state-dump] [state-dump] ClusterResources: [state-dump] LocalObjectManager: [state-dump] - num pinned objects: 1 [state-dump] - pinned objects size: 6612 [state-dump] - num objects pending restore: 1 [state-dump] - num objects pending spill: 4 [state-dump] - num bytes pending spill: 10912004476 [state-dump] - num bytes currently spilled: 76850650979 [state-dump] - cumulative spill requests: 2493 [state-dump] - cumulative restore requests: 2358 [state-dump] - spilled objects pending delete: 0 [state-dump] [state-dump] ObjectManager: [state-dump] - num local objects: 502 [state-dump] - num unfulfilled push requests: 0 [state-dump] - num object pull requests: 2 [state-dump] - num chunks received total: 0 [state-dump] - num chunks received failed (all): 0 [state-dump] - num chunks received failed / cancelled: 0 [state-dump] - num chunks received failed / plasma error: 0 [state-dump] Event stats: [state-dump] Global stats: 0 total (0 active) [state-dump] Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] Execution time: mean = -nan(ind) s, total = 0.000 s [state-dump] Event stats: [state-dump] PushManager: [state-dump] - num pushes in flight: 0 [state-dump] - num chunks in flight: 0 [state-dump] - num chunks remaining: 0 [state-dump] - max chunks allowed: 409 [state-dump] OwnershipBasedObjectDirectory: [state-dump] - num listeners: 2 [state-dump] - cumulative location updates: 3078 [state-dump] - num location updates per second: 0.000 [state-dump] - num location lookups per second: 0.000 [state-dump] - num locations added per second: 0.000 [state-dump] - num locations removed per second: 0.000 [state-dump] BufferPool: [state-dump] - create buffer state map size: 0 [state-dump] PullManager: [state-dump] - num bytes available for pulled objects: 0 [state-dump] - num bytes being pulled (all): 2728001258 [state-dump] - num bytes being pulled / pinned: 139 [state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - task request bundles: BundlePullRequestQueue{1 total, 1 active, 0 inactive, 0 unpullable} [state-dump] - first get request bundle: N/A [state-dump] - first wait request bundle: N/A [state-dump] - first task request bundle: 2 objects, 2728001258 bytes (active) [state-dump] - num objects queued: 2 [state-dump] - num objects actively pulled (all): 2 [state-dump] - num objects actively pulled / pinned: 1 [state-dump] - num bundles being pulled: 1 [state-dump] - num pull retries: 165 [state-dump] - max timeout seconds: 20 [state-dump] - max timeout request is already processed. No entry. [state-dump] - example obj id pending pull: c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000 [state-dump] [state-dump] WorkerPool: [state-dump] - registered jobs: 2 [state-dump] - process_failed_job_config_missing: 0 [state-dump] - process_failed_rate_limited: 0 [state-dump] - process_failed_pending_registration: 0 [state-dump] - process_failed_runtime_env_setup_failed: 0 [state-dump] - num PYTHON workers: 28 [state-dump] - num PYTHON drivers: 2 [state-dump] - num object spill callbacks queued: 0 [state-dump] - num object restore queued: 0 [state-dump] - num util functions queued: 0 [state-dump] - num idle workers: 16 [state-dump] TaskDependencyManager: [state-dump] - task deps map size: 1 [state-dump] - get req map size: 0 [state-dump] - wait req map size: 0 [state-dump] - local objects map size: 502 [state-dump] WaitManager: [state-dump] - num active wait requests: 0 [state-dump] Subscriber: [state-dump] Channel WORKER_REF_REMOVED_CHANNEL [state-dump] - cumulative subscribe requests: 0 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 0 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL [state-dump] - cumulative subscribe requests: 4964 [state-dump] - cumulative unsubscribe requests: 4962 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 5679 [state-dump] - cumulative processed messages: 3079 [state-dump] Channel WORKER_OBJECT_EVICTION [state-dump] - cumulative subscribe requests: 2498 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 87622 total (58 active) [state-dump] Queueing time: mean = 17.792 ms, max = 210.608 s, min = -0.000 s, total = 1558.987 s [state-dump] Execution time: mean = 33.191 ms, total = 2908.279 s [state-dump] Event stats: [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 9311 total (0 active), Execution time: mean = 35.079 us, total = 326.620 ms, Queueing time: mean = 1.055 ms, max = 281.060 ms, min = 1.209 us, total = 9.821 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 9311 total (0 active), Execution time: mean = 1.278 ms, total = 11.898 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.GlobalGC - 6490 total (0 active), Execution time: mean = 514.741 ns, total = 3.341 ms, Queueing time: mean = 314.358 us, max = 77.488 ms, min = 3.048 us, total = 2.040 s [state-dump] NodeManager.SpillObjects - 6490 total (0 active), Execution time: mean = 54.088 us, total = 351.034 ms, Queueing time: mean = 313.489 us, max = 77.492 ms, min = 1.852 us, total = 2.035 s [state-dump] ObjectManager.ObjectAdded - 4878 total (0 active), Execution time: mean = 475.249 us, total = 2.318 s, Queueing time: mean = 754.876 us, max = 297.982 ms, min = 2.166 us, total = 3.682 s [state-dump] ObjectManager.ObjectDeleted - 4376 total (0 active), Execution time: mean = 5.417 us, total = 23.705 ms, Queueing time: mean = 9.153 ms, max = 234.522 ms, min = 9.769 us, total = 40.055 s [state-dump] NodeManager.CheckGC - 3280 total (1 active), Execution time: mean = 70.158 us, total = 230.119 ms, Queueing time: mean = 9.710 ms, max = 367.464 ms, min = -0.000 s, total = 31.849 s [state-dump] ObjectManager.UpdateAvailableMemory - 3280 total (0 active), Execution time: mean = 141.910 us, total = 465.464 ms, Queueing time: mean = 570.519 us, max = 123.836 ms, min = 2.025 us, total = 1.871 s [state-dump] RaySyncer.OnDemandBroadcasting - 3279 total (1 active), Execution time: mean = 140.660 us, total = 461.225 ms, Queueing time: mean = 9.673 ms, max = 367.246 ms, min = -0.000 s, total = 31.718 s [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 3079 total (0 active), Execution time: mean = 68.904 us, total = 212.155 ms, Queueing time: mean = 12.185 ms, max = 100.138 ms, min = 115.347 us, total = 37.518 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 2691 total (0 active), Execution time: mean = 24.326 us, total = 65.461 ms, Queueing time: mean = 996.138 us, max = 421.228 ms, min = 2.361 us, total = 2.681 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 2691 total (0 active), Execution time: mean = 935.186 us, total = 2.517 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 2568 total (0 active), Execution time: mean = 28.567 us, total = 73.360 ms, Queueing time: mean = 148.533 us, max = 131.745 ms, min = 2.125 us, total = 381.434 ms [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 2568 total (0 active), Execution time: mean = 878.241 us, total = 2.255 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 2498 total (0 active), Execution time: mean = 2.298 ms, total = 5.741 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 2498 total (0 active), Execution time: mean = 770.011 us, total = 1.923 s, Queueing time: mean = 1.352 ms, max = 420.572 ms, min = 3.387 us, total = 3.378 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects - 2359 total (1 active), Execution time: mean = 88.143 ms, total = 207.930 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects.OnReplyReceived - 2358 total (0 active), Execution time: mean = 131.367 us, total = 309.764 ms, Queueing time: mean = 880.094 us, max = 92.354 ms, min = 3.837 us, total = 2.075 s [state-dump] RaySyncer.BroadcastMessage - 1963 total (0 active), Execution time: mean = 177.938 us, total = 349.293 ms, Queueing time: mean = 1.106 us, max = 458.370 us, min = 235.000 ns, total = 2.171 ms [state-dump] - 1963 total (0 active), Execution time: mean = 717.277 ns, total = 1.408 ms, Queueing time: mean = 184.777 us, max = 48.461 ms, min = 2.896 us, total = 362.717 ms [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1750 total (1 active), Execution time: mean = 14.921 us, total = 26.113 ms, Queueing time: mean = 5.770 ms, max = 286.796 ms, min = -0.000 s, total = 10.098 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 1662 total (6 active), Execution time: mean = 176.887 ms, total = 293.985 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 1656 total (0 active), Execution time: mean = 13.485 us, total = 22.331 ms, Queueing time: mean = 1.462 ms, max = 61.248 ms, min = 1.310 us, total = 2.422 s [state-dump] CoreWorkerService.grpc_client.LocalGC - 732 total (2 active), Execution time: mean = 331.566 ms, total = 242.706 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 730 total (0 active), Execution time: mean = 22.769 us, total = 16.622 ms, Queueing time: mean = 3.547 ms, max = 59.582 ms, min = 4.353 us, total = 2.589 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 357 total (1 active), Execution time: mean = 172.673 us, total = 61.644 ms, Queueing time: mean = 10.099 ms, max = 114.737 ms, min = -0.000 s, total = 3.605 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad - 357 total (0 active), Execution time: mean = 1.571 ms, total = 560.770 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 357 total (1 active), Execution time: mean = 4.885 us, total = 1.744 ms, Queueing time: mean = 10.267 ms, max = 114.777 ms, min = -0.000 s, total = 3.665 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 357 total (0 active), Execution time: mean = 78.399 us, total = 27.989 ms, Queueing time: mean = 1.300 ms, max = 328.481 ms, min = 4.832 us, total = 464.181 ms [state-dump] NodeManager.ScheduleAndDispatchTasks - 357 total (1 active), Execution time: mean = 145.524 us, total = 51.952 ms, Queueing time: mean = 10.142 ms, max = 114.775 ms, min = -0.000 s, total = 3.621 s [state-dump] ClientConnection.async_read.ProcessMessageHeader - 152 total (30 active), Execution time: mean = 9.004 us, total = 1.369 ms, Queueing time: mean = 8.929 s, max = 210.608 s, min = 9.363 us, total = 1357.187 s [state-dump] ClientConnection.async_read.ProcessMessage - 122 total (0 active), Execution time: mean = 3.524 ms, total = 429.985 ms, Queueing time: mean = 283.925 us, max = 16.113 ms, min = 3.866 us, total = 34.639 ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 120 total (1 active), Execution time: mean = 7.903 us, total = 948.407 us, Queueing time: mean = 8.048 ms, max = 28.640 ms, min = 311.940 us, total = 965.783 ms [state-dump] CoreWorkerService.grpc_client.SpillObjects - 84 total (1 active), Execution time: mean = 1.793 s, total = 150.607 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 83 total (0 active), Execution time: mean = 3.195 ms, total = 265.217 ms, Queueing time: mean = 2.841 ms, max = 232.712 ms, min = 4.917 us, total = 235.819 ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 72 total (0 active), Execution time: mean = 30.322 us, total = 2.183 ms, Queueing time: mean = 562.975 us, max = 37.758 ms, min = 6.613 us, total = 40.534 ms [state-dump] NodeManager.GcsCheckAlive - 72 total (1 active), Execution time: mean = 229.791 us, total = 16.545 ms, Queueing time: mean = 10.147 ms, max = 74.457 ms, min = -0.000 s, total = 730.574 ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 72 total (0 active), Execution time: mean = 3.372 ms, total = 242.801 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.record_metrics - 72 total (1 active), Execution time: mean = 371.240 us, total = 26.729 ms, Queueing time: mean = 10.020 ms, max = 74.247 ms, min = -0.000 s, total = 721.432 ms [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 68 total (1 active), Execution time: mean = 4.973 s, total = 338.132 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 67 total (0 active), Execution time: mean = 408.515 us, total = 27.371 ms, Queueing time: mean = 4.113 ms, max = 131.855 ms, min = 5.629 us, total = 275.559 ms [state-dump] NodeManagerService.grpc_server.GetNodeStats - 63 total (3 active), Execution time: mean = 3.868 s, total = 243.673 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 63 total (0 active), Execution time: mean = 3.383 ms, total = 213.116 ms, Queueing time: mean = 1.373 ms, max = 30.136 ms, min = 6.855 us, total = 86.477 ms [state-dump] NodeManager.deadline_timer.debug_state_dump - 36 total (1 active), Execution time: mean = 3.358 ms, total = 120.896 ms, Queueing time: mean = 13.532 ms, max = 55.986 ms, min = 1.270 us, total = 487.137 ms [state-dump] ClientConnection.async_write.DoAsyncWrites - 32 total (0 active), Execution time: mean = 1.644 us, total = 52.623 us, Queueing time: mean = 96.355 us, max = 185.153 us, min = 39.571 us, total = 3.083 ms [state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 30 total (0 active), Execution time: mean = 59.800 us, total = 1.794 ms, Queueing time: mean = 32.822 us, max = 320.264 us, min = 7.032 us, total = 984.654 us [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 30 total (0 active), Execution time: mean = 301.116 us, total = 9.033 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 21 total (0 active), Execution time: mean = 1.966 ms, total = 41.282 ms, Queueing time: mean = 94.438 ms, max = 130.977 ms, min = 11.763 us, total = 1.983 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 21 total (1 active), Execution time: mean = 65.849 s, total = 1382.836 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] WorkerPool.PopWorkerCallback - 20 total (0 active), Execution time: mean = 25.028 ms, total = 500.556 ms, Queueing time: mean = 3.421 ms, max = 37.652 ms, min = 13.397 us, total = 68.422 ms [state-dump] NodeManagerService.grpc_server.ReturnWorker.HandleRequestImpl - 16 total (0 active), Execution time: mean = 16.158 ms, total = 258.532 ms, Queueing time: mean = 2.931 ms, max = 24.837 ms, min = 9.385 us, total = 46.896 ms [state-dump] NodeManagerService.grpc_server.ReturnWorker - 16 total (0 active), Execution time: mean = 19.295 ms, total = 308.720 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 421.867 us, total = 5.062 ms, Queueing time: mean = 11.223 ms, max = 45.753 ms, min = 48.400 us, total = 134.671 ms [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 6 total (1 active, 1 running), Execution time: mean = 2.176 ms, total = 13.055 ms, Queueing time: mean = 7.311 ms, max = 13.249 ms, min = 3.219 ms, total = 43.864 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 3 total (1 active), Execution time: mean = 5.184 s, total = 15.553 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 2 total (0 active), Execution time: mean = 168.188 us, total = 336.375 us, Queueing time: mean = 12.877 us, max = 15.365 us, min = 10.388 us, total = 25.753 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 2 total (0 active), Execution time: mean = 788.298 us, total = 1.577 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 1.472 ms, total = 2.943 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 276.050 us, total = 552.100 us, Queueing time: mean = 1.588 ms, max = 2.726 ms, min = 449.900 us, total = 3.176 ms [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 2 total (0 active), Execution time: mean = 78.681 us, total = 157.362 us, Queueing time: mean = 171.833 us, max = 197.168 us, min = 146.499 us, total = 343.667 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 2 total (0 active), Execution time: mean = 50.278 us, total = 100.556 us, Queueing time: mean = 188.075 us, max = 253.285 us, min = 122.864 us, total = 376.149 us [state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 5.250 us, total = 10.500 us, Queueing time: mean = 1.900 us, max = 3.500 us, min = 300.000 ns, total = 3.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 834.100 us, total = 834.100 us, Queueing time: mean = 135.700 us, max = 135.700 us, min = 135.700 us, total = 135.700 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 10.700 us, total = 10.700 us, Queueing time: mean = 283.100 us, max = 283.100 us, min = 283.100 us, total = 283.100 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.176 ms, total = 1.176 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 677.000 us, total = 677.000 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 952.700 us, total = 952.700 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 267.500 us, total = 267.500 us, Queueing time: mean = 313.800 us, max = 313.800 us, min = 313.800 us, total = 313.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 1.305 ms, total = 1.305 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 65.803 ms, total = 65.803 ms, Queueing time: mean = 113.400 us, max = 113.400 us, min = 113.400 us, total = 113.400 us [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2024-08-06 20:26:09,759 I 3460 15984] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references. [2024-08-06 20:26:14,896 I 3460 15984] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 83696 MiB, 2497 objects, write throughput 596 MiB/s. [2024-08-06 20:26:14,896 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024B41710000, 3221225480) [2024-08-06 20:26:14,896 I 3460 15984] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-06 20:26:15,169 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024E01740000, 8589934600) [2024-08-06 20:26:15,454 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024C01720000, 4294967304) [2024-08-06 20:26:15,783 I 3460 27664] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024D01730000, 4294967304) [2024-08-06 20:26:18,108 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:26:18,109 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:26:18,109 I 3460 27664] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2728001119 [2024-08-06 20:26:18,117 C 3460 27664] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455 *** StackTrace Information *** unknown unknown unknown unknown unknown

What can I do to make it work? Ingest in smaller parts? use more partitions?

In the end this code will run on a machine with 500gb of ram, but it will also be processing datasets that are larger 200gb+

Versions / Dependencies

modin : 0.31.0 ray : 2.34.0

python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Austria.1252

pandas : 2.2.2 numpy : 1.26.4

Reproduction script

I cannot provida a surefire reproduction script because this will be highly dependent on the hardware.

heres my best attempt:

import math
import os
os.environ["MODIN_ENGINE"] = "ray"

import ray

ray.init(_memory=250 * 1024 * 1024 * 1024)

import modin.pandas as pd
import numpy as np

def generate_dataframes(num_dataframes, num_rows, num_columns):
    dataframes = []
    for _ in range(num_dataframes):
        df = pd.DataFrame(np.random.rand(num_rows, num_columns))
        dataframes.append(df)
    return dataframes

# Parameters
num_dataframes = 1200
num_rows = 1000000
num_columns = 12

dataframes = generate_dataframes(num_dataframes, num_rows, num_columns)

print('concatenating dataframes')
big_df = pd.concat(dataframes, ignore_index=True)

print('adding new column')
big_df['new_col'] = big_df[0].apply(lambda x: int(math.floor(x)))

print('saving to parquet')
big_df.to_parquet('big_df.parquet', partition_cols='new_col')

Issue Severity

High: It blocks me from completing my task.

mattip commented 2 months ago

Thanks for opening a new issue. Is this issue windows specific? Can you try out the same data/code on a similar machine running linux?

mattip commented 2 months ago

The dashboard resource reporting problems should be solved by #45578, which is not (yet) part of a release. Does the dashboard properly work if you manually apply the changes there?

Liquidmasl commented 2 months ago

Thanks for opening a new issue. Is this issue windows specific? Can you try out the same data/code on a similar machine running linux?

I can try setting this all up in a docker container. But I assume thats not that straight forward at all. Other then that I just have a linux machine with A LOT more ram at hand.

In here https://github.com/modin-project/modin/issues/7360#issuecomment-2273836170 I got a response that the issue is that I am just running out of RAM because I used the default number or partitions (== logical processors) so while parallel processing it loads all into RAM. Could this be the issue here? I will try with a higher number of partitions.

The dashboard resource reporting problems should be solved by #45578, which is not (yet) part of a release. Does the dashboard properly work if you manually apply the changes there?

I might not have the capacity to try, but good to know its in the works, for now its not a dealbreaker, thank you for pointing the way!

Liquidmasl commented 2 months ago

The proposed idea (incresing partitions) did not help. If it all I feel like it made it worse?

Liquidmasl commented 2 months ago

As another example

I load the same large dataset as I did before, then try this apply:

pcd['z_partition'] = pcd['z'].apply(lambda x: int(math.floor(x)))

As i understand and i am not sure if i understand, this should not need to load everything into memory as it is applies to rows only, should be fine to run parallel... right? I really tried to understand reading your documentation but i stay confused.

Anyway, I do run into the same issue again. While my ram hovers around 60% Screenshot 2024-08-08 131623

C has enough storage, _storage is set to 250gigs. virtual memory in windows is set to 120gb fixed ... I wish I would understand the logs raylet.out tail:

========== Plasma store: ================= Current usage: 4.32794 / 16.3893 GB - num bytes created total: 70525122415 1 pending objects of total size 863MB - objects spillable: 2 - bytes spillable: 1809869980 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 4192 - bytes in use: 4327937765 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 1809876592 - objects restored: 4189 - bytes restored: 2518061173 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-08 13:15:51,607 I 28540 1428] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:15:53,767 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 57069 MiB, 74058 objects, write throughput 1249 MiB/s. [2024-08-08 13:15:53,771 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 8589934600) [2024-08-08 13:15:53,802 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:53,851 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024320E20000, 17179869192) [2024-08-08 13:15:55,938 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934126 [2024-08-08 13:15:55,969 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10188 MiB, 17773 objects, read throughput 134 MiB/s [2024-08-08 13:15:56,171 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:57,744 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 57932 MiB, 74059 objects, write throughput 1225 MiB/s. [2024-08-08 13:15:57,745 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 34359738376) [2024-08-08 13:15:57,760 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:59,829 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934990 [2024-08-08 13:15:59,851 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10190 MiB, 17777 objects, read throughput 127 MiB/s [2024-08-08 13:16:00,053 I 28540 1428] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000868 [2024-08-08 13:16:00,079 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:16:00,853 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10547 MiB, 18400 objects, read throughput 130 MiB/s [2024-08-08 13:16:01,633 I 28540 1428] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:16:01,854 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 11118 MiB, 19396 objects, read throughput 135 MiB/s [2024-08-08 13:16:02,795 I 28540 3680] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 3.56607 / 16.3893 GB - num bytes created total: 73961601986 1 pending objects of total size 863MB - objects spillable: 1 - bytes spillable: 904934990 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 4429 - bytes in use: 3566066865 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 2 - bytes created by worker: 904941602 - objects restored: 4427 - bytes restored: 2661125263 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-08 13:16:04,525 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 58795 MiB, 74060 objects, write throughput 1199 MiB/s. [2024-08-08 13:16:04,540 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 68719476744) [2024-08-08 13:16:04,587 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:16:06,529 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934126 [2024-08-08 13:16:06,531 C 28540 3680] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455 *** StackTrace Information *** unknown unknown unknown
jjyao commented 2 months ago

I think you hit the fallback allocation code path which is not tested on Windows. Are you able to use linux or a larger windows machine with more memory?

Liquidmasl commented 2 months ago

I think you hit the fallback allocation code path which is not tested on Windows. Are you able to use linux or a larger windows machine with more memory?

this code will run in docker in production, and i have a linux dev machine available (most of the time) Since I use linux all of this is more or less a walk in the park

Maybe a big fat warning sign somewhere to, if possible, not use windows would be great, to save someone sanity haha

thank you.