Closed zmbc closed 10 months ago
@zmbc thank you for reporting this issue. I can reproduce it at version ea8088af4cadfb76294e458e5095f262ca85fea9. Indeed, you should assume that your callables will apply to pandas objects as opposed to modin.pandas objects, and you should avoid referencing modin.pandas objects within the callables. This is an important bug that we should document and possibly even warn users about the first time they use a method like apply
or map
.
Even if we did allow access to modin.pandas objects inside the callables, the performance would probably not be good because we'd have to pass the entire dataframe to every partition that's accessing the dataframe. For a distributed dataframe, you want to avoid ever materializing all the data in one worker. Better would be to rewrite the code so it's more amenable to distribution, e.g. for your example:
teams.set_index('name').merge(abbreviations.to_frame(), left_on='league_abbreviation', right_index=True, how='left')[0].rename('league')
alternatively, if the list of abbrevations is really small, use df.replace
with a dictionary of abbreviations:
teams.set_index('name').league_abbreviation.replace({'MLB': 'Major League Baseball', "NBA": 'National Basketball Association'})
Despite what I've written above, the error you get here isn't helpful at all. I'll investigate to see why Modin raises this error and whether we can either fix the bug or give a more informative error.
(by the way, a note that @zmbc might not find useful but that might be useful for me and future modin maintainers: this works fine in MODIN_ENGINE=PYTHON
. Seems like it's a ray/dask problem with serializing modin frames for partitions)
@mvashishtha Maybe changing __reduce__
function will help us here?
Something like:
@classmethod
def _inflate_full(cls, pandas_series, source_pid):
"""
Re-creates the object from previously-serialized disk-storable representation.
Parameters
----------
pandas_series : pandas.Series
Data to use for object re-creation.
Returns
-------
Series
New Series based on the `pandas_series`.
"""
import os
current_pid = os.getpid()
print(f"Current pid: {current_pid}")
if current_pid != source_pid:
# Pandas objects can be used in any processes.
return pandas_series
# Modin objects can only be created in the main process.
return cls(data=pandas_series)
def __reduce__(self):
import os
self._query_compiler.finalize()
pid = os.getpid()
print(f"Source pid: {pid}")
# there is can be the flag in Series itself:
# for example, self._make_full_reduce
if True: # PersistentPickle.get():
return self._inflate_full, (self._to_pandas(), pid)
return self._inflate_light, (self._query_compiler, self.name)
Even if we did allow access to modin.pandas objects inside the callables, the performance would probably not be good because we'd have to pass the entire dataframe to every partition that's accessing the dataframe. For a distributed dataframe, you want to avoid ever materializing all the data in one worker.
I'm not so sure that this is a fundamental limitation. If the callable could directly use modin.pandas objects, it could do some operation on them, such as a .loc
, that greatly reduced the data size before it needed to be materialized. Sure, it could do some operation using an entire Modin dataframe, and that would be slow, but I don't think anyone would be surprised that it was slow.
My example is obviously contrived and could have been done much better with join
or merge
. But I do think this functionality would be useful to have in general.
Even if we did allow access to modin.pandas objects inside the callables, the performance would probably not be good because we'd have to pass the entire dataframe to every partition that's accessing the dataframe. For a distributed dataframe, you want to avoid ever materializing all the data in one worker.
I'm not so sure that this is a fundamental limitation. If the callable could directly use modin.pandas objects, it could do some operation on them, such as a
.loc
, that greatly reduced the data size before it needed to be materialized. Sure, it could do some operation using an entire Modin dataframe, and that would be slow, but I don't think anyone would be surprised that it was slow.My example is obviously contrived and could have been done much better with
join
ormerge
. But I do think this functionality would be useful to have in general.
Agree. We can give performance warnings, but the code should work anyway.
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
This code fails because it tries to use the Modin Series
abbreviations
within a.apply
function.Running on either Ray or Dask, the error you get is pretty cryptic -- it's not easy to tell that this is the problem.
Expected Behavior
Ideally, it would behave like Pandas, allowing the use of a Modin object within an
apply
. If this is not possible, at least a nicer error message would help.Error Logs
With Ray:
With dask:
Installed Versions