modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.75k stars 651 forks source link

[RAY] to_parquet() fails when spilled objects reach 64gig... Also my data is just 40gig #7360

Open Liquidmasl opened 1 month ago

Liquidmasl commented 1 month ago

I am at a loss here.

I load multiple batches of data into modin dataframes, concat them together to 1 large modin.df and then try to save them woth to_parquet()

I do that after importing ray and calling init() on it manually, before importing modin (Because I noticed it works better like that, for some reason, see: https://github.com/modin-project/modin/issues/7359)

This (suddenly) fails with:

(raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450

Raylet.out tail is:

[2024-08-05 17:47:19,270 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 20387 MiB, 11869 objects, read throughput 193 MiB/s [2024-08-05 17:47:23,156 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 71986 MiB, 24698 objects, write throughput 786 MiB/s. [2024-08-05 17:47:23,165 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022AFDE70000, 4294967304) [2024-08-05 17:47:23,181 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:23,689 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:25,362 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143 [2024-08-05 17:47:25,369 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 20393 MiB, 11873 objects, read throughput 183 MiB/s [2024-08-05 17:47:26,306 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:26,370 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 21410 MiB, 12465 objects, read throughput 190 MiB/s [2024-08-05 17:47:26,525 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:27,030 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022F45EC0000, 8589934600) [2024-08-05 17:47:27,461 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 21565 MiB, 12555 objects, read throughput 190 MiB/s [2024-08-05 17:47:28,464 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 22714 MiB, 13224 objects, read throughput 198 MiB/s [2024-08-05 17:47:29,468 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 23631 MiB, 13758 objects, read throughput 204 MiB/s [2024-08-05 17:47:30,471 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 24823 MiB, 14452 objects, read throughput 213 MiB/s [2024-08-05 17:47:31,476 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 25959 MiB, 15113 objects, read throughput 220 MiB/s [2024-08-05 17:47:32,298 I 9128 32964] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 12.5164 / 16.3946 GB - num bytes created total: 110237068996 1 pending objects of total size 2118MB - objects spillable: 1 - bytes spillable: 2221201143 - objects unsealed: 2 - bytes unsealed: 3602226 - objects in use: 5719 - bytes in use: 12516370955 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 2221211049 - objects restored: 5716 - bytes restored: 10295159906 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-05 17:47:33,749 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:38,373 I 9128 32964] (raylet.exe) store.cc:564: ========== Plasma store: ================= Current usage: 12.5164 / 16.3946 GB - num bytes created total: 110237068996 6 pending objects of total size 4243MB - objects spillable: 1 - bytes spillable: 2221201143 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 5719 - bytes in use: 12516370955 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 2221211049 - objects restored: 5716 - bytes restored: 10295159906 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-05 17:47:38,533 I 9128 18496] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager: [state-dump] Node ID: 769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d [state-dump] Node name: 127.0.0.1 [state-dump] InitialConfigResources: {node:127.0.0.1: 10000, GPU: 10000, CPU: 200000, accelerator_type:G: 10000, node:__internal_head__: 10000, object_store_memory: 163945525240000, memory: 327891050510000} [state-dump] ClusterTaskManager: [state-dump] ========== Node: 769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d ================= [state-dump] Infeasible queue length: 0 [state-dump] Schedule queue length: 0 [state-dump] Dispatch queue length: 0 [state-dump] num_waiting_for_resource: 0 [state-dump] num_waiting_for_plasma_memory: 0 [state-dump] num_waiting_for_remote_node_resources: 0 [state-dump] num_worker_not_started_by_job_config_not_exist: 0 [state-dump] num_worker_not_started_by_registration_timeout: 0 [state-dump] num_tasks_waiting_for_workers: 0 [state-dump] num_cancelled_tasks: 0 [state-dump] cluster_resource_scheduler state: [state-dump] Local id: 3413610774594749257 Local resources: {"total":{GPU: [10000], memory: [327891050510000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000], object_store_memory: [163945525240000], CPU: [200000]}}, "available": {memory: [327891050510000], GPU: [10000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000], object_store_memory: [141733447690000], CPU: [170000]}}, "labels":{"ray.io/node_id":"769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d",} is_draining: 0 is_idle: 0 Cluster resources: node id: 3413610774594749257{"total":{accelerator_type:G: 10000, object_store_memory: 163945525240000, memory: 327891050510000, GPU: 10000, node:127.0.0.1: 10000, node:__internal_head__: 10000, CPU: 200000}}, "available": {accelerator_type:G: 10000, object_store_memory: 141733447690000, node:127.0.0.1: 10000, node:__internal_head__: 10000, CPU: 170000, memory: 327891050510000, GPU: 10000}}, "labels":{"ray.io/node_id":"769d8d10c33d0aa65d673cc59ae36aa2a415bc98bedc9abf9c87740d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []} [state-dump] Waiting tasks size: 17 [state-dump] Number of executing tasks: 3 [state-dump] Number of pinned task arguments: 3704 [state-dump] Number of total spilled tasks: 0 [state-dump] Number of spilled waiting tasks: 0 [state-dump] Number of spilled unschedulable tasks: 0 [state-dump] Resource usage { [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=32748 worker_id=6ddb3c2d71dd94e0f0088bf37e3fcf1a670358cc7b455f4c1fee6b21): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=38700 worker_id=0ed61c2030588887b62701984884036d6394165acc00d7284cb41d07): {CPU: 10000} [state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=26356 worker_id=10571ed2b2a722f937d05a922ea592292840df519e1ae3aa25732572): {CPU: 10000} [state-dump] } [state-dump] Running tasks by scheduling class: [state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=2f63268387a243db90a5dac06c8ea5d7} scheduling_strategy=default_scheduling_strategy { [state-dump] } [state-dump] resource_set={CPU : 1, }}: 3/20 [state-dump] ================================================== [state-dump] [state-dump] ClusterResources: [state-dump] LocalObjectManager: [state-dump] - num pinned objects: 1 [state-dump] - pinned objects size: 6612 [state-dump] - num objects pending restore: 1688 [state-dump] - num objects pending spill: 1 [state-dump] - num bytes pending spill: 2221201143 [state-dump] - num bytes currently spilled: 62177759756 [state-dump] - cumulative spill requests: 24698 [state-dump] - cumulative restore requests: 15596 [state-dump] - spilled objects pending delete: 0 [state-dump] [state-dump] ObjectManager: [state-dump] - num local objects: 5719 [state-dump] - num unfulfilled push requests: 0 [state-dump] - num object pull requests: 9885 [state-dump] - num chunks received total: 0 [state-dump] - num chunks received failed (all): 0 [state-dump] - num chunks received failed / cancelled: 0 [state-dump] - num chunks received failed / plasma error: 0 [state-dump] Event stats: [state-dump] Global stats: 8 total (0 active) [state-dump] Queueing time: mean = 32.360 us, max = 80.258 us, min = 10.471 us, total = 258.883 us [state-dump] Execution time: mean = 4.537 us, total = 36.296 us [state-dump] Event stats: [state-dump] ObjectManager.FreeObjects - 8 total (0 active), Execution time: mean = 4.537 us, total = 36.296 us, Queueing time: mean = 32.360 us, max = 80.258 us, min = 10.471 us, total = 258.883 us [state-dump] PushManager: [state-dump] - num pushes in flight: 0 [state-dump] - num chunks in flight: 0 [state-dump] - num chunks remaining: 0 [state-dump] - max chunks allowed: 409 [state-dump] OwnershipBasedObjectDirectory: [state-dump] - num listeners: 9885 [state-dump] - cumulative location updates: 31436 [state-dump] - num location updates per second: 0.000 [state-dump] - num location lookups per second: 0.000 [state-dump] - num locations added per second: 0.000 [state-dump] - num locations removed per second: 0.000 [state-dump] BufferPool: [state-dump] - create buffer state map size: 0 [state-dump] PullManager: [state-dump] - num bytes available for pulled objects: 3878181569 [state-dump] - num bytes being pulled (all): 6667728944 [state-dump] - num bytes being pulled / pinned: 3627450788 [state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - task request bundles: BundlePullRequestQueue{17 total, 3 active, 14 inactive, 0 unpullable} [state-dump] - first get request bundle: N/A [state-dump] - first wait request bundle: N/A [state-dump] - first task request bundle: 1236 objects, 2222582914 bytes (inactive, waiting for capacity) [state-dump] - num objects queued: 9885 [state-dump] - num objects actively pulled (all): 3704 [state-dump] - num objects actively pulled / pinned: 2016 [state-dump] - num bundles being pulled: 3 [state-dump] - num pull retries: 15509 [state-dump] - max timeout seconds: 40 [state-dump] - max timeout request is already processed. No entry. [state-dump] - example obj id pending pull: 00ffffffffffffffffffffffffffffffffffffff010000000b23f605 [state-dump] [state-dump] WorkerPool: [state-dump] - registered jobs: 1 [state-dump] - process_failed_job_config_missing: 0 [state-dump] - process_failed_rate_limited: 0 [state-dump] - process_failed_pending_registration: 0 [state-dump] - process_failed_runtime_env_setup_failed: 0 [state-dump] - num PYTHON workers: 28 [state-dump] - num PYTHON drivers: 1 [state-dump] - num object spill callbacks queued: 0 [state-dump] - num object restore queued: 1684 [state-dump] - num util functions queued: 0 [state-dump] - num idle workers: 17 [state-dump] TaskDependencyManager: [state-dump] - task deps map size: 17 [state-dump] - get req map size: 0 [state-dump] - wait req map size: 0 [state-dump] - local objects map size: 5719 [state-dump] WaitManager: [state-dump] - num active wait requests: 0 [state-dump] Subscriber: [state-dump] Channel WORKER_OBJECT_EVICTION [state-dump] - cumulative subscribe requests: 24702 [state-dump] - cumulative unsubscribe requests: 8 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 8 [state-dump] - cumulative processed messages: 8 [state-dump] Channel WORKER_REF_REMOVED_CHANNEL [state-dump] - cumulative subscribe requests: 0 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 0 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL [state-dump] - cumulative subscribe requests: 39517 [state-dump] - cumulative unsubscribe requests: 29632 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 47483 [state-dump] - cumulative processed messages: 32677 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 353895 total (79 active) [state-dump] Queueing time: mean = 19.490 ms, max = 387.349 s, min = -0.000 s, total = 6897.369 s [state-dump] Execution time: mean = 9.490 ms, total = 3358.373 s [state-dump] Event stats: [state-dump] ObjectManager.ObjectAdded - 40319 total (0 active), Execution time: mean = 350.596 us, total = 14.136 s, Queueing time: mean = 3.771 ms, max = 927.645 ms, min = 1.599 us, total = 152.043 s [state-dump] ObjectManager.ObjectDeleted - 34600 total (0 active), Execution time: mean = 3.454 us, total = 119.514 ms, Queueing time: mean = 30.240 ms, max = 1.025 s, min = 8.823 us, total = 1046.301 s [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 32677 total (0 active), Execution time: mean = 147.025 us, total = 4.804 s, Queueing time: mean = 75.026 ms, max = 712.568 ms, min = 63.484 us, total = 2451.636 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 24702 total (0 active), Execution time: mean = 6.426 ms, total = 158.744 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 24702 total (0 active), Execution time: mean = 723.704 us, total = 17.877 s, Queueing time: mean = 5.482 ms, max = 927.276 ms, min = 2.264 us, total = 135.418 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 24315 total (0 active), Execution time: mean = 22.131 us, total = 538.125 ms, Queueing time: mean = 982.294 us, max = 1.025 s, min = 2.151 us, total = 23.884 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 24315 total (0 active), Execution time: mean = 684.681 us, total = 16.648 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 24306 total (0 active), Execution time: mean = 638.580 us, total = 15.521 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 24306 total (0 active), Execution time: mean = 26.304 us, total = 639.346 ms, Queueing time: mean = 432.775 us, max = 1.183 s, min = 1.579 us, total = 10.519 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects - 15600 total (4 active), Execution time: mean = 28.853 ms, total = 450.106 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects.OnReplyReceived - 15596 total (0 active), Execution time: mean = 170.114 us, total = 2.653 s, Queueing time: mean = 1.451 ms, max = 501.009 ms, min = 1.459 us, total = 22.631 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 11542 total (0 active), Execution time: mean = 16.816 ms, total = 194.095 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 11542 total (0 active), Execution time: mean = 36.374 us, total = 419.833 ms, Queueing time: mean = 16.492 ms, max = 1.059 s, min = 1.762 us, total = 190.353 s [state-dump] NodeManager.SpillObjects - 6676 total (1 active), Execution time: mean = 188.136 us, total = 1.256 s, Queueing time: mean = 25.744 ms, max = 1.025 s, min = 2.328 us, total = 171.867 s [state-dump] NodeManager.GlobalGC - 6676 total (1 active), Execution time: mean = 697.991 ns, total = 4.660 ms, Queueing time: mean = 25.742 ms, max = 1.025 s, min = 4.080 us, total = 171.856 s [state-dump] NodeManager.CheckGC - 4216 total (1 active), Execution time: mean = 83.181 us, total = 350.690 ms, Queueing time: mean = 13.818 ms, max = 1.117 s, min = -0.000 s, total = 58.256 s [state-dump] ObjectManager.UpdateAvailableMemory - 4215 total (0 active), Execution time: mean = 2.301 ms, total = 9.700 s, Queueing time: mean = 2.061 ms, max = 911.750 ms, min = 2.216 us, total = 8.688 s [state-dump] RaySyncer.OnDemandBroadcasting - 4215 total (1 active), Execution time: mean = 428.190 us, total = 1.805 s, Queueing time: mean = 13.501 ms, max = 1.117 s, min = 82.643 us, total = 56.906 s [state-dump] RaySyncer.BroadcastMessage - 3531 total (0 active), Execution time: mean = 219.941 us, total = 776.612 ms, Queueing time: mean = 1.353 us, max = 511.349 us, min = 156.000 ns, total = 4.779 ms [state-dump] - 3531 total (1 active), Execution time: mean = 1.624 us, total = 5.735 ms, Queueing time: mean = 2.937 ms, max = 1.027 s, min = 2.589 us, total = 10.369 s [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 2258 total (1 active), Execution time: mean = 16.172 us, total = 36.517 ms, Queueing time: mean = 12.684 ms, max = 1.053 s, min = -0.000 s, total = 28.642 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 2220 total (5 active), Execution time: mean = 207.490 ms, total = 460.628 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 2215 total (0 active), Execution time: mean = 17.258 us, total = 38.227 ms, Queueing time: mean = 10.347 ms, max = 967.538 ms, min = 1.467 us, total = 22.919 s [state-dump] CoreWorkerService.grpc_client.LocalGC - 823 total (5 active), Execution time: mean = 440.409 ms, total = 362.457 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 818 total (0 active), Execution time: mean = 30.099 us, total = 24.621 ms, Queueing time: mean = 5.146 ms, max = 308.246 ms, min = 7.799 us, total = 4.209 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 475 total (0 active), Execution time: mean = 91.509 us, total = 43.467 ms, Queueing time: mean = 15.303 ms, max = 731.127 ms, min = 3.877 us, total = 7.269 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad - 475 total (0 active), Execution time: mean = 15.682 ms, total = 7.449 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.ScheduleAndDispatchTasks - 469 total (1 active), Execution time: mean = 348.364 us, total = 163.383 ms, Queueing time: mean = 24.594 ms, max = 763.148 ms, min = -0.000 s, total = 11.535 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 469 total (1 active), Execution time: mean = 391.263 us, total = 183.502 ms, Queueing time: mean = 24.543 ms, max = 762.978 ms, min = -0.000 s, total = 11.511 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 468 total (1 active), Execution time: mean = 1.755 ms, total = 821.451 ms, Queueing time: mean = 24.152 ms, max = 762.977 ms, min = 110.743 us, total = 11.303 s [state-dump] ClusterResourceManager.ResetRemoteNodeView - 159 total (1 active), Execution time: mean = 8.141 us, total = 1.294 ms, Queueing time: mean = 25.493 ms, max = 1.070 s, min = 89.436 us, total = 4.053 s [state-dump] ClientConnection.async_read.ProcessMessageHeader - 148 total (29 active), Execution time: mean = 13.389 us, total = 1.982 ms, Queueing time: mean = 15.137 s, max = 387.349 s, min = 8.087 us, total = 2240.230 s [state-dump] ClientConnection.async_read.ProcessMessage - 119 total (0 active), Execution time: mean = 6.902 ms, total = 821.373 ms, Queueing time: mean = 16.078 ms, max = 871.154 ms, min = 2.864 us, total = 1.913 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 114 total (1 active), Execution time: mean = 4.011 s, total = 457.201 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 113 total (0 active), Execution time: mean = 1.176 ms, total = 132.833 ms, Queueing time: mean = 89.132 ms, max = 1.178 s, min = 8.429 us, total = 10.072 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 96 total (0 active), Execution time: mean = 1.029 ms, total = 98.798 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 96 total (0 active), Execution time: mean = 41.960 us, total = 4.028 ms, Queueing time: mean = 7.564 ms, max = 703.496 ms, min = 3.812 us, total = 726.177 ms [state-dump] NodeManager.GcsCheckAlive - 96 total (1 active), Execution time: mean = 261.187 us, total = 25.074 ms, Queueing time: mean = 41.860 ms, max = 722.902 ms, min = 184.407 us, total = 4.019 s [state-dump] NodeManager.deadline_timer.record_metrics - 96 total (1 active), Execution time: mean = 354.789 us, total = 34.060 ms, Queueing time: mean = 41.782 ms, max = 722.885 ms, min = 242.923 us, total = 4.011 s [state-dump] NodeManagerService.grpc_server.GetNodeStats - 88 total (1 active), Execution time: mean = 1.710 s, total = 150.510 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 88 total (0 active), Execution time: mean = 6.723 ms, total = 591.644 ms, Queueing time: mean = 11.016 ms, max = 281.735 ms, min = 7.991 us, total = 969.405 ms [state-dump] NodeManager.deadline_timer.debug_state_dump - 48 total (1 active), Execution time: mean = 3.090 ms, total = 148.338 ms, Queueing time: mean = 58.900 ms, max = 723.917 ms, min = -0.000 s, total = 2.827 s [state-dump] CoreWorkerService.grpc_client.SpillObjects - 43 total (1 active), Execution time: mean = 3.862 s, total = 166.045 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 42 total (0 active), Execution time: mean = 111.446 ms, total = 4.681 s, Queueing time: mean = 62.842 ms, max = 387.449 ms, min = 8.059 us, total = 2.639 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 37 total (0 active), Execution time: mean = 11.567 ms, total = 427.961 ms, Queueing time: mean = 305.619 ms, max = 578.807 ms, min = 10.398 us, total = 11.308 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 37 total (17 active), Execution time: mean = 22.799 s, total = 843.559 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ClientConnection.async_write.DoAsyncWrites - 30 total (0 active), Execution time: mean = 1.728 us, total = 51.845 us, Queueing time: mean = 8.521 ms, max = 123.820 ms, min = 67.728 us, total = 255.616 ms [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 29 total (0 active), Execution time: mean = 69.950 ms, total = 2.029 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 29 total (0 active), Execution time: mean = 65.678 us, total = 1.905 ms, Queueing time: mean = 69.595 ms, max = 545.721 ms, min = 12.926 us, total = 2.018 s [state-dump] WorkerPool.PopWorkerCallback - 20 total (0 active), Execution time: mean = 24.273 ms, total = 485.460 ms, Queueing time: mean = 87.041 ms, max = 824.407 ms, min = 50.721 us, total = 1.741 s [state-dump] NodeManagerService.grpc_server.ReturnWorker - 17 total (0 active), Execution time: mean = 161.319 ms, total = 2.742 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.ReturnWorker.HandleRequestImpl - 17 total (0 active), Execution time: mean = 60.500 ms, total = 1.029 s, Queueing time: mean = 100.602 ms, max = 466.690 ms, min = 10.003 us, total = 1.710 s [state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 314.958 us, total = 3.780 ms, Queueing time: mean = 21.744 ms, max = 113.592 ms, min = 40.100 us, total = 260.934 ms [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_EVICTION - 8 total (0 active), Execution time: mean = 126.471 us, total = 1.012 ms, Queueing time: mean = 200.971 us, max = 570.158 us, min = 82.400 us, total = 1.608 ms [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 8 total (1 active, 1 running), Execution time: mean = 2.299 ms, total = 18.390 ms, Queueing time: mean = 8.998 ms, max = 18.059 ms, min = 2.549 ms, total = 71.981 ms [state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects - 6 total (0 active), Execution time: mean = 243.350 ms, total = 1.460 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects.OnReplyReceived - 6 total (0 active), Execution time: mean = 31.829 us, total = 190.973 us, Queueing time: mean = 69.500 ms, max = 416.933 ms, min = 11.997 us, total = 417.002 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 2 total (1 active), Execution time: mean = 2.070 s, total = 4.141 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 2.900 us, total = 5.800 us, Queueing time: mean = 750.000 ns, max = 1.200 us, min = 300.000 ns, total = 1.500 us [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 137.350 us, total = 274.700 us, Queueing time: mean = 1.840 ms, max = 3.445 ms, min = 235.500 us, total = 3.681 ms [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 538.150 us, total = 1.076 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 558.200 us, total = 558.200 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 646.700 us, total = 646.700 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 904.800 us, total = 904.800 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 127.421 ms, total = 127.421 ms, Queueing time: mean = 12.800 us, max = 12.800 us, min = 12.800 us, total = 12.800 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 1 total (0 active), Execution time: mean = 946.834 us, total = 946.834 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 975.800 us, total = 975.800 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 1 total (0 active), Execution time: mean = 192.163 us, total = 192.163 us, Queueing time: mean = 11.301 us, max = 11.301 us, min = 11.301 us, total = 11.301 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 1 total (0 active), Execution time: mean = 73.135 us, total = 73.135 us, Queueing time: mean = 185.466 us, max = 185.466 us, min = 185.466 us, total = 185.466 us [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 21.800 us, total = 21.800 us, Queueing time: mean = 18.000 us, max = 18.000 us, min = 18.000 us, total = 18.000 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 376.600 us, total = 376.600 us, Queueing time: mean = 79.500 us, max = 79.500 us, min = 79.500 us, total = 79.500 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 227.800 us, total = 227.800 us, Queueing time: mean = 20.000 us, max = 20.000 us, min = 20.000 us, total = 20.000 us [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), Execution time: mean = 50.844 us, total = 50.844 us, Queueing time: mean = 174.838 us, max = 174.838 us, min = 174.838 us, total = 174.838 us [state-dump] DebugString() time ms: 3 [state-dump] [state-dump] [2024-08-05 17:47:38,742 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 74104 MiB, 24699 objects, write throughput 756 MiB/s. [2024-08-05 17:47:38,746 I 9128 32964] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000022AFDE70000, 17179869192) [2024-08-05 17:47:38,746 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:40,993 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141 [2024-08-05 17:47:41,002 I 9128 18496] (raylet.exe) local_object_manager.cc:490: Restored 26790 MiB, 15597 objects, read throughput 210 MiB/s [2024-08-05 17:47:41,437 I 9128 18496] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 000000000000082C [2024-08-05 17:47:41,526 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:43,806 I 9128 18496] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-05 17:47:46,401 I 9128 18496] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 76223 MiB, 24700 objects, write throughput 740 MiB/s. [2024-08-05 17:47:46,508 I 9128 18496] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-05 17:47:48,416 I 9128 32964] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141 [2024-08-05 17:47:48,473 C 9128 32964] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450 *** StackTrace Information *** unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown unknown

Even more annoying is that this did not happen a few hours ago. I could successfully save the parquet file. When I checked the tmp folder I realised that the spilled objects folder is exactly 64gigs in size, which seamed strange to me. Especially because the log message somehow sounds like it might has something todo with missing space.

Can I somehow set the max size of spilled objects? Why has it spilled more then the size of the data? Are some of the spills from prior operations and not cleaned up?

Whats going on?

Liquidmasl commented 1 month ago

Also: i dont get log messages for spills. what made me think that i dont spill, but the existence of the ray spill folder suggests that i do.

Also I upped the moddin memory using the envvar MODIN_MEMORY to 100gb now it did not crash at 64gig, but it spilled 88gb. which is around 4x the size of the completed .parquet folder (in which the parts are compressed with snappy)

Why is it spilling so much?

Liquidmasl commented 1 month ago

Nevermind, Today everything that worked yesterday does not work anymore.

I cant seam to be able to save with to_parquet(). Always getting

(raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
[2024-08-06 15:07:29,442 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143
[2024-08-06 15:07:29,498 I 42036 19508] (raylet.exe) local_object_manager.cc:490: Restored 19095 MiB, 11117 objects, read throughput 233 MiB/s
[2024-08-06 15:07:30,297 I 42036 19508] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000834
[2024-08-06 15:07:30,342 I 42036 19508] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 15:07:32,413 I 42036 19508] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references.
[2024-08-06 15:07:36,110 I 42036 28692] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000017C2F150000, 2281701384)
[2024-08-06 15:07:36,355 I 42036 19508] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 69868 MiB, 24697 objects, write throughput 851 MiB/s.
[2024-08-06 15:07:36,356 I 42036 19508] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 15:07:36,360 I 42036 28692] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000017D3F170000, 4294967304)
[2024-08-06 15:07:38,633 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143
[2024-08-06 15:07:38,634 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 15:07:38,664 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 15:07:39,689 C 42036 28692] (raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455

I expect the shared memory to get full and it to spill. Basically thats why i use modin, to be able work out of core and spill.

But it seams something isnt working out. I am sweating for 2 days now just to save and load parquets. Havent even arrived at processing the data. please save me

Liquidmasl commented 1 month ago

It turns out MODIN_MEMORY does absolutely nothing if ray is initialised manually. It seams that it worked for a few times was just beacuse at the moment ray was initialised my virtual memory was larger then another time (its dynamic per default in windows if i am not mistaken) so pure coincidence.

BUT:

If ray is initialised by modin, the envvar is used, but it is set as _memory AND object_store_memory which leads to my pc freezing and dieing.

Also the _memory, can be larger then the shared memory, while the object_store_memory cannot! So it actually is very debilitating that modin initializes ray like that!

I feel like those 2 values should not be set to the same value, not per default and honestly also not as fallback! I will open ANOTHER issue with this. As I think this is not intended, or should be changed.

Liquidmasl commented 1 month ago

this seams to be the issue here: https://github.com/modin-project/modin/issues/7361

Liquidmasl commented 1 month ago

Well... its not, I still run into this issue

Retribution98 commented 1 month ago

Hi @Liquidmasl

Thanks for your attention to Modin.

As I see from your reproducer in issue #7359 , you are trying to work with a large DataFrame that is bigger than the available memory. You are getting this error because the ray worker tried to get part of your DataFrame, but you don't have enough memory to do so.

Introducing

First of all, please read the docs article "How Modin splits a DataFrame" to have a clearer understanding of the next steps.

Let's deal with your pipeline:

  1. You create Modin DataFrames with 1,000,000 rows and 12 columns in a loop. This means that each one is partitioned (NPartitions, 1) and all partitions are placed in the ray plasma. If you don't have enough memory to store them all in ray plasma, they are automatically spilled into ray storage.
  2. You merge sub DataFrames into one large DataFrame. Here Modin merges the partitions from the sub DataFrames into virtual partitions, so that you also end up with a partition shape (NPartitions, 1). Virtual partitions means that they are run as a single partition, but include multiple internal partitions before the function is applied. In your example, each of the virtual partitions has 1200 internal partitions!
  3. You apply the 'to_parquet' function. Here, each of the row partitions is sent to the ray worker and executed in parallel. This means that the entire large DataFrame is loaded into RAM at the same time.

Suggested solution

To avoid this, you can increase the value of NPartitions. For example, if you set the NPartitions to CPU count * 2, only half of your large data frame will be executed simultaneously.

NOTE: A large number of NPartitions can lead to slower performance, but sometimes it is the only way.

You can modify NPartitions globally:

import modin.config as cfg

cfg.NPartition.put(100)

or locally:

import modin.config as cfg

with cfg.context(NPartitions=100):
    .... some code here

Important note

However, this tip does not always help, because some DataFrames operations require loading all partitions on the full axis. Unfortunately, we don't have enough resources right now to fix this quickly, but it's a good task for the future.

Liquidmasl commented 1 month ago

That do makes sense, but in respect to the log output it also kinda does not. This error somewhat clearly appears when ray tries to spill to the hard drive, not write into memory.

anyway, I am currently attempting with more partitions!

result:

It does not help, I tried with up to 120 partitions, everything worked the same (maybe a bit slower) but on saving the same error occured. Meanwhile my ram did not even reach 60%

raylet.out tail:

[2024-08-08 13:43:36,028 I 42024 41500] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager: [state-dump] Node ID: 0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631 [state-dump] Node name: 127.0.0.1 [state-dump] InitialConfigResources: {accelerator_type:G: 10000, object_store_memory: 154834710520000, GPU: 10000, memory: 2684354560000000, node:__internal_head__: 10000, CPU: 200000, node:127.0.0.1: 10000} [state-dump] ClusterTaskManager: [state-dump] ========== Node: 0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631 ================= [state-dump] Infeasible queue length: 0 [state-dump] Schedule queue length: 0 [state-dump] Dispatch queue length: 19 [state-dump] num_waiting_for_resource: 0 [state-dump] num_waiting_for_plasma_memory: 0 [state-dump] num_waiting_for_remote_node_resources: 0 [state-dump] num_worker_not_started_by_job_config_not_exist: 0 [state-dump] num_worker_not_started_by_registration_timeout: 0 [state-dump] num_tasks_waiting_for_workers: 19 [state-dump] num_cancelled_tasks: 0 [state-dump] cluster_resource_scheduler state: [state-dump] Local id: -5160272000863189213 Local resources: {"total":{CPU: [200000], GPU: [10000], memory: [2684354560000000], object_store_memory: [154834710520000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000]}}, "available": {CPU: [10000], GPU: [10000], memory: [2684354560000000], object_store_memory: [83027139280000], node:__internal_head__: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000]}}, "labels":{"ray.io/node_id":"0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631",} is_draining: 0 is_idle: 0 Cluster resources: node id: -5160272000863189213{"total":{GPU: 10000, node:__internal_head__: 10000, memory: 2684354560000000, node:127.0.0.1: 10000, accelerator_type:G: 10000, object_store_memory: 154834710520000, CPU: 200000}}, "available": {node:__internal_head__: 10000, CPU: 10000, node:127.0.0.1: 10000, accelerator_type:G: 10000, object_store_memory: 83027139280000, memory: 2684354560000000, GPU: 10000}}, "labels":{"ray.io/node_id":"0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []} [state-dump] Waiting tasks size: 101 [state-dump] Number of executing tasks: 19 [state-dump] Number of pinned task arguments: 23448 [state-dump] Number of total spilled tasks: 0 [state-dump] Number of spilled waiting tasks: 0 [state-dump] Number of spilled unschedulable tasks: 0 [state-dump] Resource usage { [state-dump] } [state-dump] Running tasks by scheduling class: [state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=73d18cd9fd8147f3a86c6c6a91295ef2} scheduling_strategy=default_scheduling_strategy { [state-dump] } [state-dump] resource_set={CPU : 1, }}: 19/20 [state-dump] ================================================== [state-dump] [state-dump] ClusterResources: [state-dump] LocalObjectManager: [state-dump] - num pinned objects: 23924 [state-dump] - pinned objects size: 7180757124 [state-dump] - num objects pending restore: 4936 [state-dump] - num objects pending spill: 0 [state-dump] - num bytes pending spill: 0 [state-dump] - num bytes currently spilled: 37385042614 [state-dump] - cumulative spill requests: 124161 [state-dump] - cumulative restore requests: 0 [state-dump] - spilled objects pending delete: 0 [state-dump] [state-dump] ObjectManager: [state-dump] - num local objects: 45234 [state-dump] - num unfulfilled push requests: 0 [state-dump] - num object pull requests: 148082 [state-dump] - num chunks received total: 0 [state-dump] - num chunks received failed (all): 0 [state-dump] - num chunks received failed / cancelled: 0 [state-dump] - num chunks received failed / plasma error: 0 [state-dump] Event stats: [state-dump] Global stats: 0 total (0 active) [state-dump] Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] Execution time: mean = -nan(ind) s, total = 0.000 s [state-dump] Event stats: [state-dump] PushManager: [state-dump] - num pushes in flight: 0 [state-dump] - num chunks in flight: 0 [state-dump] - num chunks remaining: 0 [state-dump] - max chunks allowed: 409 [state-dump] OwnershipBasedObjectDirectory: [state-dump] - num listeners: 148082 [state-dump] - cumulative location updates: 5719 [state-dump] - num location updates per second: 407.860 [state-dump] - num location lookups per second: 0.000 [state-dump] - num locations added per second: 0.661 [state-dump] - num locations removed per second: 0.000 [state-dump] BufferPool: [state-dump] - create buffer state map size: 0 [state-dump] PullManager: [state-dump] - num bytes available for pulled objects: 8302713928 [state-dump] - num bytes being pulled (all): 1486258332 [state-dump] - num bytes being pulled / pinned: 9906 [state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable} [state-dump] - task request bundles: BundlePullRequestQueue{120 total, 4 active, 0 inactive, 116 unpullable} [state-dump] - first get request bundle: N/A [state-dump] - first wait request bundle: N/A [state-dump] - first task request bundle: 1236 objects (inactive, waiting for object sizes or locations) [state-dump] - num objects queued: 148082 [state-dump] - num objects actively pulled (all): 4938 [state-dump] - num objects actively pulled / pinned: 2 [state-dump] - num bundles being pulled: 4 [state-dump] - num pull retries: 0 [state-dump] - max timeout seconds: 10 [state-dump] - max timeout object id: 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505 [state-dump] - max timeout object: ObjectPullRequest{locations: [], spilled url: C:\Users\MARCEL~1\AppData\Local\Temp\ray\session_2024-08-08_13-36-28_633854_22036\ray_spilled_objects_0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631\e79cdd7b42bd4f45a7d18561e9c8ba4b-multi-1999?offset=575007314&size=301222, spilled node id: 0cac365450a301fc1a0c0c508d70de625072ea9737813cc546fa6631, pending creation: 0, object size set: 1, object size: 301123, num of retries: 1} [state-dump] - example obj id pending pull: 00ffffffffffffffffffffffffffffffffffffff01000000feeff505 [state-dump] [state-dump] WorkerPool: [state-dump] - registered jobs: 1 [state-dump] - process_failed_job_config_missing: 0 [state-dump] - process_failed_rate_limited: 0 [state-dump] - process_failed_pending_registration: 0 [state-dump] - process_failed_runtime_env_setup_failed: 0 [state-dump] - num PYTHON workers: 24 [state-dump] - num PYTHON drivers: 1 [state-dump] - num object spill callbacks queued: 0 [state-dump] - num object restore queued: 4936 [state-dump] - num util functions queued: 0 [state-dump] - num idle workers: 1 [state-dump] TaskDependencyManager: [state-dump] - task deps map size: 120 [state-dump] - get req map size: 0 [state-dump] - wait req map size: 0 [state-dump] - local objects map size: 45234 [state-dump] WaitManager: [state-dump] - num active wait requests: 0 [state-dump] Subscriber: [state-dump] Channel WORKER_REF_REMOVED_CHANNEL [state-dump] - cumulative subscribe requests: 0 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 0 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL [state-dump] - cumulative subscribe requests: 148082 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 7704 [state-dump] - cumulative processed messages: 7704 [state-dump] Channel WORKER_OBJECT_EVICTION [state-dump] - cumulative subscribe requests: 148085 [state-dump] - cumulative unsubscribe requests: 0 [state-dump] - active subscribed publishers: 1 [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 1153937 total (2228 active) [state-dump] Queueing time: mean = 12.470 ms, max = 3.510 s, min = -0.001 s, total = 14389.233 s [state-dump] Execution time: mean = 3.506 ms, total = 4045.684 s [state-dump] Event stats: [state-dump] ObjectManager.ObjectAdded - 148106 total (0 active), Execution time: mean = 127.767 us, total = 18.923 s, Queueing time: mean = 34.620 ms, max = 1.551 s, min = 1.645 us, total = 5127.458 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 148085 total (0 active), Execution time: mean = 18.597 ms, total = 2753.970 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 148085 total (0 active), Execution time: mean = 520.464 us, total = 77.073 s, Queueing time: mean = 17.922 ms, max = 1.551 s, min = 1.861 us, total = 2653.945 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 137936 total (0 active), Execution time: mean = 16.989 us, total = 2.343 s, Queueing time: mean = 569.503 us, max = 1.416 s, min = 1.996 us, total = 78.555 s [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 137936 total (0 active), Execution time: mean = 435.116 us, total = 60.018 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 137852 total (1 active), Execution time: mean = 436.470 us, total = 60.168 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 137851 total (0 active), Execution time: mean = 19.882 us, total = 2.741 s, Queueing time: mean = 203.670 us, max = 1.540 s, min = 1.912 us, total = 28.076 s [state-dump] ObjectManager.ObjectDeleted - 102872 total (0 active), Execution time: mean = 4.118 us, total = 423.650 ms, Queueing time: mean = 14.680 ms, max = 38.958 ms, min = 11.829 us, total = 1510.195 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 9870 total (10 active), Execution time: mean = 49.507 ms, total = 488.632 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 9870 total (10 active), Execution time: mean = 17.844 us, total = 176.125 ms, Queueing time: mean = 49.300 ms, max = 1.547 s, min = 1.520 us, total = 486.591 s [state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 7704 total (2000 active), Execution time: mean = 226.027 us, total = 1.741 s, Queueing time: mean = 515.599 ms, max = 1.224 s, min = 19.347 ms, total = 3972.173 s [state-dump] ObjectManager.UpdateAvailableMemory - 3591 total (1 active), Execution time: mean = 3.176 us, total = 11.405 ms, Queueing time: mean = 4.429 ms, max = 1.320 s, min = 2.441 us, total = 15.904 s [state-dump] NodeManager.CheckGC - 3591 total (1 active), Execution time: mean = 17.914 us, total = 64.331 ms, Queueing time: mean = 17.390 ms, max = 1.624 s, min = -0.000 s, total = 62.448 s [state-dump] RaySyncer.OnDemandBroadcasting - 3591 total (1 active), Execution time: mean = 244.897 us, total = 879.425 ms, Queueing time: mean = 17.164 ms, max = 1.624 s, min = -0.000 s, total = 61.636 s [state-dump] RaySyncer.BroadcastMessage - 3313 total (0 active), Execution time: mean = 158.429 us, total = 524.875 ms, Queueing time: mean = 849.278 ns, max = 26.226 us, min = 283.000 ns, total = 2.814 ms [state-dump] - 3313 total (1 active), Execution time: mean = 480.202 ns, total = 1.591 ms, Queueing time: mean = 3.954 ms, max = 1.213 s, min = 2.454 us, total = 13.100 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 1962 total (0 active), Execution time: mean = 3.186 ms, total = 6.250 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 1962 total (25 active), Execution time: mean = 11.250 us, total = 22.073 ms, Queueing time: mean = 33.133 ms, max = 972.978 ms, min = 1.723 us, total = 65.007 s [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1942 total (1 active), Execution time: mean = 12.060 us, total = 23.420 ms, Queueing time: mean = 17.144 ms, max = 1.511 s, min = -0.001 s, total = 33.294 s [state-dump] CoreWorkerService.grpc_client.LocalGC - 542 total (0 active), Execution time: mean = 83.064 ms, total = 45.021 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 542 total (0 active), Execution time: mean = 67.006 us, total = 36.317 ms, Queueing time: mean = 16.712 ms, max = 1.547 s, min = 4.060 us, total = 9.058 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 416 total (0 active), Execution time: mean = 81.545 us, total = 33.923 ms, Queueing time: mean = 47.093 ms, max = 1.372 s, min = 2.222 us, total = 19.591 s [state-dump] NodeManagerService.grpc_server.GetResourceLoad - 416 total (0 active), Execution time: mean = 47.362 ms, total = 19.702 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 404 total (1 active), Execution time: mean = 5.291 us, total = 2.138 ms, Queueing time: mean = 45.807 ms, max = 1.516 s, min = -0.000 s, total = 18.506 s [state-dump] NodeManager.ScheduleAndDispatchTasks - 404 total (1 active), Execution time: mean = 6.679 us, total = 2.698 ms, Queueing time: mean = 45.818 ms, max = 1.516 s, min = -0.000 s, total = 18.510 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 404 total (1 active), Execution time: mean = 4.052 us, total = 1.637 ms, Queueing time: mean = 45.808 ms, max = 1.516 s, min = -0.000 s, total = 18.506 s [state-dump] ClusterResourceManager.ResetRemoteNodeView - 139 total (1 active), Execution time: mean = 5.786 us, total = 804.186 us, Queueing time: mean = 54.044 ms, max = 1.278 s, min = 255.321 us, total = 7.512 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 120 total (0 active), Execution time: mean = 20.868 ms, total = 2.504 s, Queueing time: mean = 1.005 s, max = 1.526 s, min = 13.944 us, total = 120.543 s [state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 120 total (120 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ClientConnection.async_read.ProcessMessageHeader - 96 total (25 active), Execution time: mean = 8.567 us, total = 822.419 us, Queueing time: mean = 43.132 ms, max = 3.510 s, min = 57.500 us, total = 4.141 s [state-dump] NodeManager.deadline_timer.record_metrics - 84 total (1 active), Execution time: mean = 374.128 us, total = 31.427 ms, Queueing time: mean = 58.742 ms, max = 1.097 s, min = -0.001 s, total = 4.934 s [state-dump] NodeManager.GcsCheckAlive - 84 total (1 active), Execution time: mean = 200.007 us, total = 16.801 ms, Queueing time: mean = 58.897 ms, max = 1.097 s, min = -0.000 s, total = 4.947 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 84 total (0 active), Execution time: mean = 27.872 us, total = 2.341 ms, Queueing time: mean = 45.771 ms, max = 1.319 s, min = 3.903 us, total = 3.845 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 84 total (0 active), Execution time: mean = 969.758 us, total = 81.460 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats - 83 total (1 active), Execution time: mean = 80.732 ms, total = 6.701 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 83 total (0 active), Execution time: mean = 2.334 ms, total = 193.692 ms, Queueing time: mean = 38.283 ms, max = 1.104 s, min = 4.055 us, total = 3.177 s [state-dump] ClientConnection.async_read.ProcessMessage - 71 total (0 active), Execution time: mean = 5.734 ms, total = 407.092 ms, Queueing time: mean = 52.471 us, max = 696.357 us, min = 3.095 us, total = 3.725 ms [state-dump] CoreWorkerService.grpc_client.SpillObjects - 65 total (0 active), Execution time: mean = 900.436 ms, total = 58.528 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 65 total (0 active), Execution time: mean = 252.959 ms, total = 16.442 s, Queueing time: mean = 515.010 ms, max = 1.112 s, min = 16.481 us, total = 33.476 s [state-dump] NodeManager.deadline_timer.debug_state_dump - 42 total (1 active), Execution time: mean = 2.184 ms, total = 91.725 ms, Queueing time: mean = 85.780 ms, max = 1.067 s, min = 152.746 us, total = 3.603 s [state-dump] NodeManagerService.grpc_server.GetSystemConfig - 29 total (0 active), Execution time: mean = 166.859 ms, total = 4.839 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 29 total (0 active), Execution time: mean = 62.147 us, total = 1.802 ms, Queueing time: mean = 166.557 ms, max = 1.265 s, min = 8.546 us, total = 4.830 s [state-dump] ClientConnection.async_write.DoAsyncWrites - 26 total (0 active), Execution time: mean = 1.814 us, total = 47.173 us, Queueing time: mean = 170.127 us, max = 656.537 us, min = 64.104 us, total = 4.423 ms [state-dump] WorkerPool.PopWorkerCallback - 19 total (19 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 371.158 us, total = 4.454 ms, Queueing time: mean = 24.624 ms, max = 122.793 ms, min = 54.300 us, total = 295.489 ms [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 7 total (1 active), Execution time: mean = 58.987 s, total = 412.911 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.print_event_loop_stats - 7 total (1 active, 1 running), Execution time: mean = 1.564 ms, total = 10.951 ms, Queueing time: mean = 160.022 ms, max = 1.084 s, min = 362.164 us, total = 1.120 s [state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 7 total (1 active), Execution time: mean = 3.316 ms, total = 23.213 ms, Queueing time: mean = 605.603 ms, max = 1.542 s, min = 16.063 ms, total = 4.239 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 598.900 us, total = 1.198 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 2 total (1 active), Execution time: mean = 1.975 s, total = 3.949 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 224.300 us, total = 448.600 us, Queueing time: mean = 2.151 ms, max = 4.108 ms, min = 194.400 us, total = 4.303 ms [state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 32.250 us, total = 64.500 us, Queueing time: mean = 2.600 us, max = 3.100 us, min = 2.100 us, total = 5.200 us [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 693.000 us, total = 693.000 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 1 total (0 active), Execution time: mean = 290.013 us, total = 290.013 us, Queueing time: mean = 16.229 us, max = 16.229 us, min = 16.229 us, total = 16.229 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 1 total (0 active), Execution time: mean = 95.243 us, total = 95.243 us, Queueing time: mean = 275.898 us, max = 275.898 us, min = 275.898 us, total = 275.898 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 1 total (0 active), Execution time: mean = 1.129 ms, total = 1.129 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.877 ms, total = 1.877 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 908.800 us, total = 908.800 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), Execution time: mean = 71.369 us, total = 71.369 us, Queueing time: mean = 356.800 us, max = 356.800 us, min = 356.800 us, total = 356.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 412.500 us, total = 412.500 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 395.400 us, total = 395.400 us, Queueing time: mean = 13.100 us, max = 13.100 us, min = 13.100 us, total = 13.100 us [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 28.700 us, total = 28.700 us, Queueing time: mean = 11.800 us, max = 11.800 us, min = 11.800 us, total = 11.800 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 145.486 ms, total = 145.486 ms, Queueing time: mean = 17.200 us, max = 17.200 us, min = 17.200 us, total = 17.200 us [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 492.000 us, total = 492.000 us, Queueing time: mean = 9.300 us, max = 9.300 us, min = 9.300 us, total = 9.300 us [state-dump] DebugString() time ms: 3 [state-dump] [state-dump] [2024-08-08 13:43:41,340 I 42024 41500] (raylet.exe) local_object_manager.cc:490: Restored 0 MiB, 1 objects, read throughput 0 MiB/s [2024-08-08 13:43:42,290 I 42024 43672] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 7.18136 / 15.4835 GB - num bytes created total: 44567606649 1 pending objects of total size 353MB - objects spillable: 476 - bytes spillable: 142731399 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 23926 - bytes in use: 7181359371 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 23924 - bytes created by worker: 7180757124 - objects restored: 2 - bytes restored: 602247 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-08 13:43:42,518 I 42024 41500] (raylet.exe) local_object_manager.cc:490: Restored 0 MiB, 3 objects, read throughput 0 MiB/s [2024-08-08 13:43:43,551 I 42024 41500] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:43:43,569 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 35663 MiB, 124196 objects, write throughput 1279 MiB/s. [2024-08-08 13:43:44,792 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:45,595 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,595 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 348538381 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370199077 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370199077 [2024-08-08 13:43:45,596 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370199077 [2024-08-08 13:43:46,012 I 42024 41500] (raylet.exe) local_object_manager.cc:490: Restored 2 MiB, 7 objects, read throughput 0 MiB/s [2024-08-08 13:43:46,852 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 35670 MiB, 124220 objects, write throughput 1241 MiB/s. [2024-08-08 13:43:47,202 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:47,678 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:47,830 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:47,960 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 35675 MiB, 124239 objects, write throughput 1195 MiB/s. [2024-08-08 13:43:47,970 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:48,172 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:48,535 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:49,711 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:49,721 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 36447 MiB, 126944 objects, write throughput 1152 MiB/s. [2024-08-08 13:43:49,871 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:51,888 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:51,970 I 42024 41500] (raylet.exe) local_object_manager.cc:490: Restored 2 MiB, 9 objects, read throughput 0 MiB/s [2024-08-08 13:43:52,121 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:52,257 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 36815 MiB, 126997 objects, write throughput 1153 MiB/s. [2024-08-08 13:43:52,285 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:52,427 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:52,652 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:52,818 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:53,406 I 42024 41500] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 36982 MiB, 127581 objects, write throughput 1118 MiB/s. [2024-08-08 13:43:53,409 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:53,547 I 42024 41500] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:43:53,681 I 42024 41500] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:43:55,557 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:55,558 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370199077 [2024-08-08 13:43:55,558 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:55,558 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:55,558 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370199077 [2024-08-08 13:43:55,558 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:55,559 I 42024 43672] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 370201957 [2024-08-08 13:43:55,562 C 42024 43672] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455 *** StackTrace Information *** unknown unknown unknown unknown unknown

(As a test I tried with a smaller dataset that works fine, with 120 partitions it still works fine and it saves it then into 120 parquets proving that the partitioning is actually being done)

Liquidmasl commented 1 month ago

As another example

I load the same large dataset as I did before, then try this apply:

pcd['z_partition'] = pcd['z'].apply(lambda x: int(math.floor(x)))

As i understand and i am not sure if i understand, this should not need to load everything into memory as it is applies to rows only, should be fine to run parallel... right? I really tried to understand reading your documentation but i stay confused.

Anyway, I do run into the same issue again. While my ram hovers around 60% Screenshot 2024-08-08 131623

C has enough storage, _storage is set to 250gigs. virtual memory in windows is set to 120gb fixed ... I wish I would understand the logs raylet.out tail:

========== Plasma store: ================= Current usage: 4.32794 / 16.3893 GB - num bytes created total: 70525122415 1 pending objects of total size 863MB - objects spillable: 2 - bytes spillable: 1809869980 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 4192 - bytes in use: 4327937765 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 3 - bytes created by worker: 1809876592 - objects restored: 4189 - bytes restored: 2518061173 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-08 13:15:51,607 I 28540 1428] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:15:53,767 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 57069 MiB, 74058 objects, write throughput 1249 MiB/s. [2024-08-08 13:15:53,771 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 8589934600) [2024-08-08 13:15:53,802 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:53,851 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024320E20000, 17179869192) [2024-08-08 13:15:55,938 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934126 [2024-08-08 13:15:55,969 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10188 MiB, 17773 objects, read throughput 134 MiB/s [2024-08-08 13:15:56,171 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:57,744 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 57932 MiB, 74059 objects, write throughput 1225 MiB/s. [2024-08-08 13:15:57,745 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 34359738376) [2024-08-08 13:15:57,760 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:15:59,829 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934990 [2024-08-08 13:15:59,851 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10190 MiB, 17777 objects, read throughput 127 MiB/s [2024-08-08 13:16:00,053 I 28540 1428] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000868 [2024-08-08 13:16:00,079 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:16:00,853 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 10547 MiB, 18400 objects, read throughput 130 MiB/s [2024-08-08 13:16:01,633 I 28540 1428] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references. [2024-08-08 13:16:01,854 I 28540 1428] (raylet.exe) local_object_manager.cc:490: Restored 11118 MiB, 19396 objects, read throughput 135 MiB/s [2024-08-08 13:16:02,795 I 28540 3680] (raylet.exe) store.cc:513: Plasma store at capacity ========== Plasma store: ================= Current usage: 3.56607 / 16.3893 GB - num bytes created total: 73961601986 1 pending objects of total size 863MB - objects spillable: 1 - bytes spillable: 904934990 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 4429 - bytes in use: 3566066865 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 2 - bytes created by worker: 904941602 - objects restored: 4427 - bytes restored: 2661125263 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-08-08 13:16:04,525 I 28540 1428] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 58795 MiB, 74060 objects, write throughput 1199 MiB/s. [2024-08-08 13:16:04,540 I 28540 3680] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000024120E10000, 68719476744) [2024-08-08 13:16:04,587 I 28540 1428] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle. [2024-08-08 13:16:06,529 I 28540 3680] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 904934126 [2024-08-08 13:16:06,531 C 28540 3680] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455 *** StackTrace Information *** unknown unknown unknown
Liquidmasl commented 1 month ago

But theres more!

If I chunk the dataframe manually and save the chunks it works fine (its just lead to more issues later when i try to load the parquet files again)

        os.makedirs(path, exist_ok=True)
        chunk_size = 1_000_000
        num_chunks = int(np.ceil(len(result) / chunk_size))

        for i in tqdm(range(num_chunks), desc='saving chunks to parquets'):
            start_idx = i * chunk_size
            end_idx = min((i + 1) * chunk_size, len(result))
            chunk_df = result.iloc[start_idx:end_idx]

            chunk_file = os.path.join(path, f"chunk_{i}.parquet")
            chunk_df.to_parquet(chunk_file)
            del chunk_df

but if i increase the partitions to.. say 120. I get the error from before again.

Liquidmasl commented 1 month ago

I cannot reproduce this issue on Linux at all. It just seams to behave like it should there.