paddymul / buckaroo

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.
https://buckaroo-data.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
196 stars 8 forks source link

Buckaroo crashes Python interpreter when DataFrames include ``pandas.Timestamp`` #267

Closed theOehrly closed 2 months ago

theOehrly commented 5 months ago

Checks

What type of jupyter notebook were you using (VSCode notebook, google colab, Jupyter Lab, Jupyter notebook). Select multiple if you can reproduce this in multiple environments. If other, please add to description.

Google Colab, Jupyter Notebook

Reproducible example

import buckaroo
import pandas as pd

example_df = pd.DataFrame({'A': [pd.Timestamp(day=1, month=1, year=2024), 
                              pd.Timestamp(day=2, month=1, year=2024), 
                              pd.Timestamp(day=3, month=1, year=2024)], 
                        'B': [4, 5, 6]})

example_df

Issue description

When a pandas.DataFrame includes values of type pandas.Timestamp, this results in a crash of the Python interpreter when the DataFrame is visualized in Jupyter using Buckaroo.

The problem is reproducible in at least:

For completeness, while the Python interpreter crashes in Goolge Colab (and apparently produces a segmentation fault in other environments), it simply hangs indefinitely on Windows. Or at least more than a few minutes. After that, I killed the process.

Here is a sample Google Colab notebook that reproduces the crash: https://colab.research.google.com/drive/1KgW5a_Ufw1np3RrueS8o11jYM_7rqC3Z?usp=sharing

This issue was originally reported by @paddymul in https://github.com/theOehrly/Fast-F1/issues/565

Expected behavior

The interpreter should not crash. The DataFrame should be visualized correctly.

Installed versions

``` Selected Jupyter core packages... executing in google-colab buckaroo : 0.6.11 jupyterlab : not installed notebook : 6.5.5 ipywidgets : 7.7.1 traitlets : 5.7.1 jupyter_core : 5.7.2 pandas : 2.0.3 numpy : 1.25.2 IPython : 7.34.0 ipykernel : 5.5.6 jupyter_client : 6.1.12 jupyter_server : 1.24.0 nbclient : 0.10.0 nbconvert : 6.5.4 nbformat : 5.10.3 qtconsole : not installed buckaroo : /usr/local/lib/python3.10/dist-packages/buckaroo/__init__.py jupyterlab : not installed notebook : /usr/local/lib/python3.10/dist-packages/notebook/__init__.py ipywidgets : /usr/local/lib/python3.10/dist-packages/ipywidgets/__init__.py traitlets : /usr/local/lib/python3.10/dist-packages/traitlets/__init__.py jupyter_core : /usr/local/lib/python3.10/dist-packages/jupyter_core/__init__.py pandas : /usr/local/lib/python3.10/dist-packages/pandas/__init__.py numpy : /usr/local/lib/python3.10/dist-packages/numpy/__init__.py IPython : /usr/local/lib/python3.10/dist-packages/IPython/__init__.py ipykernel : /usr/local/lib/python3.10/dist-packages/ipykernel/__init__.py jupyter_client : /usr/local/lib/python3.10/dist-packages/jupyter_client/__init__.py jupyter_server : /usr/local/lib/python3.10/dist-packages/jupyter_server/__init__.py nbclient : /usr/local/lib/python3.10/dist-packages/nbclient/__init__.py nbconvert : /usr/local/lib/python3.10/dist-packages/nbconvert/__init__.py nbformat : /usr/local/lib/python3.10/dist-packages/nbformat/__init__.py qtconsole : not installed ```

Jupyter Log output

No response

nasrin1748 commented 5 months ago

Python interpreter crashes in Goolge Colab (and apparently produces a segmentation fault in other environments), it simply hangs indefinitely on Windows. Or at least more than a few minutes. After that, I killed the process.......When something like this happens Disabling buckaroo is the best option.It's mentioned in the documentation. To run buckaroo in Google Colab it needs special initiation code.....Ref to the following https://colab.research.google.com/github/paddymul/buckaroo/blob/main/example-notebooks/Full-tour.ipynb

theOehrly commented 5 months ago

When something like this happens Disabling buckaroo is the best option.It's mentioned in the documentation.

I don't really care about Buckaroo, I don't use it. I'm just responding to an issue that was opened against my project by the maintainer of buckaroo. It turns out the problem is not caused by my project, though. Having already done the debugging work and having more or less isolated the problem, I decided to open this issue here.

To run buckaroo in Google Colab it needs special initiation code.....Ref to the following https://colab.research.google.com/github/paddymul/buckaroo/blob/main/example-notebooks/Full-tour.ipynb

Right, I missed that. It doesn't matter, but I updated the example accordingly. As expected, the problem persists. It happens during the internal serialization of the pandas object to JSON. The actual crash seems to be inside json.dumps.

paddymul commented 5 months ago

@theOehrly Thanks for the bug report! I hadn't narrowed it down to TimeStamp. My strong suspicion is that this is an upstream bug in pandas. Nothing that Buckaroo does should cause a segfault.

I will close this bug once I have a workaround released for buckaroo. I will also file a bug with pandas, but expect that to take longer.

theOehrly commented 5 months ago

@paddymul calling .to_json() directly on the original DataFrame works fine.

But the code in pandas.io receives an object that seems like it was modified by buckaroo (or something else in the chain here). At least the repr contained additional info that looked like it was related to buckaroo. Then the seg fault occured in json.dumps when this object was passed to it.

I stopped investigating there, but I wouldn't rule out buckaroo completely yet.

paddymul commented 5 months ago

I filed a bug against pandas here https://github.com/pandas-dev/pandas/issues/58160

This exists in pandas 2.0.3 and is fixed in pandas 2.1.0. I currently support pandas back to 1.3.5, so I will try to find an earlier version of pandas where this does work, and adjust the project requirements accordingly

paddymul commented 2 months ago

fixed with 0.6.12