modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.83k stars 652 forks source link

Usability problem when using Modin on omnisci in a jupyter notebook #3425

Open gshimansky opened 3 years ago

gshimansky commented 3 years ago

System information

Ubuntu 20.04.1 LTS

0.10.2+22.g119a8be2

Python 3.9.7

Start a jupyter notebook like this

MODIN_BACKEND=omnisci MODIN_EXPERIMENTAL=true jupyter notebook

and create a new notebook with the following cells:

# Cell 1
import modin.pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
# Cell 2
df = df.astype({"a": "timestamp"})
# Cell 3
df

Describe the problem

Execute cells sequentially, Cell 1, Cell 2, Cell 3. When Cell 3 is executed, it becomes apparent that there is a problem in Cell 2 which was recorded in lazy execution queue. But when you return back to Cell 2 and try to fix it, no changes take effect because original statement from Cell 2 is recoded inside of df. This contradicts to the experience that people normally expect from jupyter notebook.

Source code / logs

devin-petersohn commented 3 years ago

@gshimansky I am not sure I understand the issue here. Can you expand a bit? When you overwrite variables in python this is what is expected, in my experience.

gshimansky commented 3 years ago

@gshimansky I am not sure I understand the issue here. Can you expand a bit? When you overwrite variables in python this is what is expected, in my experience.

With lazy execution after executing statement in Cell 2, a statement that causes an exception, this code is remembered in df. Any attempt to access df will produce an exception afterwards because Modin attempts to flush execution queue every time and every time gets an exception again. There is no way to get problematic code out of df.

In Python you get an exception once, variable df is not modified as a result of Cell 2 and you can happily you use for whatever way you like later. You can modify the problematic statement, you can skip to Cell 3 to print, no exception will be risen.

devin-petersohn commented 3 years ago

Ah, yes I see. That makes sense, you are right. If df is broken via an Exception thrown in the partitions it is forever broken. We need a way to roll back the inplace changes if an Exception is thrown.