Closed Liquidmasl closed 2 months ago
Thanks for opening a new issue. Is this issue windows specific? Can you try out the same data/code on a similar machine running linux?
The dashboard resource reporting problems should be solved by #45578, which is not (yet) part of a release. Does the dashboard properly work if you manually apply the changes there?
Thanks for opening a new issue. Is this issue windows specific? Can you try out the same data/code on a similar machine running linux?
I can try setting this all up in a docker container. But I assume thats not that straight forward at all. Other then that I just have a linux machine with A LOT more ram at hand.
In here https://github.com/modin-project/modin/issues/7360#issuecomment-2273836170 I got a response that the issue is that I am just running out of RAM because I used the default number or partitions (== logical processors) so while parallel processing it loads all into RAM. Could this be the issue here? I will try with a higher number of partitions.
The dashboard resource reporting problems should be solved by #45578, which is not (yet) part of a release. Does the dashboard properly work if you manually apply the changes there?
I might not have the capacity to try, but good to know its in the works, for now its not a dealbreaker, thank you for pointing the way!
The proposed idea (incresing partitions) did not help. If it all I feel like it made it worse?
As another example
I load the same large dataset as I did before, then try this apply:
pcd['z_partition'] = pcd['z'].apply(lambda x: int(math.floor(x)))
As i understand and i am not sure if i understand, this should not need to load everything into memory as it is applies to rows only, should be fine to run parallel... right? I really tried to understand reading your documentation but i stay confused.
Anyway, I do run into the same issue again. While my ram hovers around 60%
C has enough storage, _storage
is set to 250gigs. virtual memory in windows is set to 120gb fixed
... I wish I would understand the logs
raylet.out tail:
I think you hit the fallback allocation code path which is not tested on Windows. Are you able to use linux or a larger windows machine with more memory?
I think you hit the fallback allocation code path which is not tested on Windows. Are you able to use linux or a larger windows machine with more memory?
this code will run in docker in production, and i have a linux dev machine available (most of the time) Since I use linux all of this is more or less a walk in the park
Maybe a big fat warning sign somewhere to, if possible, not use windows would be great, to save someone sanity haha
thank you.
What happened + What you expected to happen
I am desperate, confused, and in need of help!
If I try to load a big dataset (using modin) with from_parquet, try to apply a function to a column, save to_parquet, etc a raylet dies with this error:
Smaller datasets work fine.
I work on a single pc, with 64gigs ram and 20 logical processors on windows. a 43gb file (with 12 columns and around 12mrd rows) leads to issues
Setting
_memory
higher does not seam to change anything (anymore, it was nescessary to even get this far) Currently The logs suggest that it is spilling data just fine as soon as the object store is full, but at some point it just stops. Although I am not positive on that, as the logs are quite complicated. Also the dashboard seams to say 0B are used from the storage.Dashboard still shows 0/200gb for Memory. (I have set
_memory
to 200gb inray.init()
) I had this issue earlier when my harddrive space was full (fair enough) now I have around 600gb free. I have set my virtual memory to 120gb (twice my ram), but that all did not help.I am still unsure if this is a User/Hardware error, or some bug. I also wrote... a bunch.. of posts in modins issue board, here the most relevant one to this question: https://github.com/modin-project/modin/issues/7360
raylet.out tail:
What can I do to make it work? Ingest in smaller parts? use more partitions?
In the end this code will run on a machine with 500gb of ram, but it will also be processing datasets that are larger 200gb+
Versions / Dependencies
modin : 0.31.0 ray : 2.34.0
python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Austria.1252
pandas : 2.2.2 numpy : 1.26.4
Reproduction script
I cannot provida a surefire reproduction script because this will be highly dependent on the hardware.
heres my best attempt:
Issue Severity
High: It blocks me from completing my task.