Open Liquidmasl opened 1 month ago
Also: i dont get log messages for spills. what made me think that i dont spill, but the existence of the ray spill folder suggests that i do.
Also I upped the moddin memory using the envvar MODIN_MEMORY to 100gb now it did not crash at 64gig, but it spilled 88gb. which is around 4x the size of the completed .parquet folder (in which the parts are compressed with snappy)
Why is it spilling so much?
Nevermind, Today everything that worked yesterday does not work anymore.
I cant seam to be able to save with to_parquet(). Always getting
(raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
[2024-08-06 15:07:29,442 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143
[2024-08-06 15:07:29,498 I 42036 19508] (raylet.exe) local_object_manager.cc:490: Restored 19095 MiB, 11117 objects, read throughput 233 MiB/s
[2024-08-06 15:07:30,297 I 42036 19508] (raylet.exe) client.cc:266: Erasing re-used mmap entry for fd 0000000000000834
[2024-08-06 15:07:30,342 I 42036 19508] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 15:07:32,413 I 42036 19508] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references.
[2024-08-06 15:07:36,110 I 42036 28692] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000017C2F150000, 2281701384)
[2024-08-06 15:07:36,355 I 42036 19508] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 69868 MiB, 24697 objects, write throughput 851 MiB/s.
[2024-08-06 15:07:36,356 I 42036 19508] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 15:07:36,360 I 42036 28692] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000017D3F170000, 4294967304)
[2024-08-06 15:07:38,633 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201143
[2024-08-06 15:07:38,634 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 15:07:38,664 I 42036 28692] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 15:07:39,689 C 42036 28692] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
I expect the shared memory to get full and it to spill. Basically thats why i use modin, to be able work out of core and spill.
But it seams something isnt working out. I am sweating for 2 days now just to save and load parquets. Havent even arrived at processing the data. please save me
It turns out MODIN_MEMORY does absolutely nothing if ray is initialised manually. It seams that it worked for a few times was just beacuse at the moment ray was initialised my virtual memory was larger then another time (its dynamic per default in windows if i am not mistaken) so pure coincidence.
If ray is initialised by modin, the envvar is used, but it is set as _memory AND object_store_memory which leads to my pc freezing and dieing.
Also the _memory, can be larger then the shared memory, while the object_store_memory cannot! So it actually is very debilitating that modin initializes ray like that!
I feel like those 2 values should not be set to the same value, not per default and honestly also not as fallback! I will open ANOTHER issue with this. As I think this is not intended, or should be changed.
this seams to be the issue here: https://github.com/modin-project/modin/issues/7361
Well... its not, I still run into this issue
Hi @Liquidmasl
Thanks for your attention to Modin.
As I see from your reproducer in issue #7359 , you are trying to work with a large DataFrame that is bigger than the available memory. You are getting this error because the ray worker tried to get part of your DataFrame, but you don't have enough memory to do so.
First of all, please read the docs article "How Modin splits a DataFrame" to have a clearer understanding of the next steps.
Let's deal with your pipeline:
To avoid this, you can increase the value of NPartitions.
For example, if you set the NPartitions to CPU count * 2
, only half of your large data frame will be executed simultaneously.
NOTE: A large number of NPartitions can lead to slower performance, but sometimes it is the only way.
You can modify NPartitions globally:
import modin.config as cfg
cfg.NPartition.put(100)
or locally:
import modin.config as cfg
with cfg.context(NPartitions=100):
.... some code here
However, this tip does not always help, because some DataFrames operations require loading all partitions on the full axis. Unfortunately, we don't have enough resources right now to fix this quickly, but it's a good task for the future.
That do makes sense, but in respect to the log output it also kinda does not. This error somewhat clearly appears when ray tries to spill to the hard drive, not write into memory.
anyway, I am currently attempting with more partitions!
It does not help, I tried with up to 120 partitions, everything worked the same (maybe a bit slower) but on saving the same error occured. Meanwhile my ram did not even reach 60%
raylet.out tail:
(As a test I tried with a smaller dataset that works fine, with 120 partitions it still works fine and it saves it then into 120 parquets proving that the partitioning is actually being done)
As another example
I load the same large dataset as I did before, then try this apply:
pcd['z_partition'] = pcd['z'].apply(lambda x: int(math.floor(x)))
As i understand and i am not sure if i understand, this should not need to load everything into memory as it is applies to rows only, should be fine to run parallel... right? I really tried to understand reading your documentation but i stay confused.
Anyway, I do run into the same issue again. While my ram hovers around 60%
C has enough storage, _storage
is set to 250gigs. virtual memory in windows is set to 120gb fixed
... I wish I would understand the logs
raylet.out tail:
But theres more!
If I chunk the dataframe manually and save the chunks it works fine (its just lead to more issues later when i try to load the parquet files again)
os.makedirs(path, exist_ok=True)
chunk_size = 1_000_000
num_chunks = int(np.ceil(len(result) / chunk_size))
for i in tqdm(range(num_chunks), desc='saving chunks to parquets'):
start_idx = i * chunk_size
end_idx = min((i + 1) * chunk_size, len(result))
chunk_df = result.iloc[start_idx:end_idx]
chunk_file = os.path.join(path, f"chunk_{i}.parquet")
chunk_df.to_parquet(chunk_file)
del chunk_df
but if i increase the partitions to.. say 120. I get the error from before again.
I cannot reproduce this issue on Linux at all. It just seams to behave like it should there.
I am at a loss here.
I load multiple batches of data into modin dataframes, concat them together to 1 large modin.df and then try to save them woth to_parquet()
I do that after importing ray and calling init() on it manually, before importing modin (Because I noticed it works better like that, for some reason, see: https://github.com/modin-project/modin/issues/7359)
This (suddenly) fails with:
Raylet.out tail is:
Even more annoying is that this did not happen a few hours ago. I could successfully save the parquet file. When I checked the tmp folder I realised that the spilled objects folder is exactly 64gigs in size, which seamed strange to me. Especially because the log message somehow sounds like it might has something todo with missing space.
Can I somehow set the max size of spilled objects? Why has it spilled more then the size of the data? Are some of the spills from prior operations and not cleaned up?
Whats going on?