Closed Liquidmasl closed 2 months ago
This method, in partition_manager
seams to be the issue:
@classmethod
def map_partitions_joined_by_column(
cls,
partitions,
column_splits,
map_func,
map_func_args=None,
map_func_kwargs=None,
):
"""
Combine several blocks by column into one virtual partition and apply "map_func" to them.
Parameters
----------
partitions : NumPy 2D array
Partitions of Modin Frame.
column_splits : int
The number of splits by column.
map_func : callable
Function to apply.
map_func_args : iterable, optional
Positional arguments for the 'map_func'.
map_func_kwargs : dict, optional
Keyword arguments for the 'map_func'.
Returns
-------
NumPy array
An array of new partitions for Modin Frame.
"""
if column_splits < 1:
raise ValueError(
"The value of columns_splits must be greater than or equal to 1."
)
# step cannot be less than 1
step = max(partitions.shape[0] // column_splits, 1)
preprocessed_map_func = cls.preprocess_func(map_func)
kw = {
"num_splits": step,
}
result = np.empty(partitions.shape, dtype=object)
for i in range(
0,
partitions.shape[0],
step,
):
joined_column_partitions = cls.column_partitions(partitions[i : i + step])
for j in range(partitions.shape[1]):
result[i : i + step, j] = joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
)
return result
So when partitions.shape
is (75,1)
when in the outer loop i = 74, step = 2 (I dont know where step comes from)
this returnes a list with is 2 long:
joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
)
while this just wants 1 element:
result[i : i + step, j]
which means in this case...
result[74 : 75, 0] # where this just means ..
result[74,0] # because the shape[0] is just 75
This all seams to happen because in map_partitions
if np.prod(partitions.shape) <= 1.5 * CpuCount.get():
this evaluates to false when i go above 64 partitions, so the behaviour changes.
Still I fail to see how I can fix this. I dont need nor want column partition, as we just have very little columns. This Completely blocks my progress currently, and I am at a bit of a loss.
back to the partition_manager
joined_column_partitions = cls.column_partitions(partitions[i : i + step])
for j in range(partitions.shape[1]):
result[i : i + step, j] = joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
)
I fund something I dont quite understand
partitions[i : i + step].shape
= (1,1)
this makes sense
joined_column_partitions
is just 1 element, which makes sense.
joined_column_partitions[j].apply(...)
returnes 2 elements, which kinda doesnt make sense to me. And it also breaks.
Makes me nuts. I need to solve this but I dont understand
So it seams the issue is that the apply function gets the number of splits with the **kw
parameter.
this is set right before with:
kw = {
"num_splits" : step
}
This is an issue because in case of step 2
and 75 partitions, the last piece will just be 1 element, as noted in the previous comment.
changing it to
kw = {
"num_splits": len(partitions[i : i + step]),
}
The issue does not appear.
But I cant see how I can apply that fix for my pipeline now.... Without pulling and building from source.. right? I would highly appreciate help from someone who knows whatsup here haha
I am also affected by this issue. In my case I'm using awswrangler and a ray cluster and this has been very difficult to reproduce. It seems that it's sensitive to partitions and only happens sometimes.
I thought i could help myself out by just using even number of partitions, but that does only help for small number of partitions. using 266 partitions I get
File "/usr/local/lib/python3.11/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 873, in map_partitions_joined_by_column
2024-09-19T10:57:22.379431833Z result[i : i + step, j] = joined_column_partitions[j].apply(
2024-09-19T10:57:22.379433676Z ~~~~~~^^^^^^^^^^^^^^^^^
2024-09-19T10:57:22.379435480Z ValueError: could not broadcast input array from shape (8,) into shape (2,)
This is a mayor issue, as it completely undermines the whole purpose of the package. I need to be able to set partition sizes without it crashing...
So it seams the issue is that the apply function gets the number of splits with the
**kw
parameter.this is set right before with:
kw = { "num_splits" : step }
This is an issue because in case of step
2
and 75 partitions, the last piece will just be 1 element, as noted in the previous comment.changing it to
kw = { "num_splits": len(partitions[i : i + step]), }
The issue does not appear.
But I cant see how I can apply that fix for my pipeline now.... Without pulling and building from source.. right? I would highly appreciate help from someone who knows whatsup here haha
Could some collaborator look into this solution if it would be viable for a hotfix? To my (somone who has to little insight in the inner workings) it looks like a simple bug with simpler solution...
Hi @Liquidmasl! Thanks for researching the problem. I opened https://github.com/modin-project/modin/pull/7399 to fix this problem. Will you be able to upgrade to the new version of Modin?
Hi @Liquidmasl! Thanks for researching the problem. I opened #7399 to fix this problem. Will you be able to upgrade to the new version of Modin?
I cannot tell you how hapy you are making me right now!
I will be able to upgrade to the new version immediatly yes!
Thank you very much for the super fast response time here, amazing!
I suppose it might be some time until the next release, so until then I will try and install directly from github, lets see how successful I am
Thanks again!
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
I was am really struggling creating a reproducer, all the data I create does not lead to the error. I will further try to find artificial data that works.
I have a pointcloud that I save in 75 partitions to parquets
When I load it again (using a fresh process), i get issues
leads to
works fine
Setting MinColumnsPerPartition to something larger then the amount of columns I have (its just 15 columns) nothing changes.
Issue Description
using partitions numbers that are no power of 2 can lead to issues. Details below
Expected Behavior
I would like the operations to not fail with the given error
Error Logs
Installed Versions