mito-ds / mito

The mitosheet package, trymito.io, and other public Mito code.
https://trymito.io
Other
2.29k stars 158 forks source link

Merging large datasets can lead to multiple new dataframes #1008

Open aarondr77 opened 1 year ago

aarondr77 commented 1 year ago
  1. Download this large dataset from Kaggle
  2. Run the following code to make it even larger!
from mitosheet.public.v3 import *; # Analysis Name:id-xeotaadybr;
import pandas as pd

# Imported Pakistan Largest Ecommerce Dataset.csv
Pakistan_Largest_Ecommerce_Dataset = pd.read_csv(r'/Users/aarondiamond-reivich/Downloads/Pakistan Largest Ecommerce Dataset.csv')

# Duplicated Pakistan_Largest_Ecommerce_Dataset
Pakistan_Largest_Ecommerce_Dataset_copy = Pakistan_Largest_Ecommerce_Dataset.copy(deep=True)

Pakistan_Largest_Ecommerce_Dataset_copy_2 = Pakistan_Largest_Ecommerce_Dataset.copy(deep=True)

# Concatenated 2 into dataframes into df_concat
df_concat = pd.concat([Pakistan_Largest_Ecommerce_Dataset, Pakistan_Largest_Ecommerce_Dataset_copy, Pakistan_Largest_Ecommerce_Dataset_copy_2], join='inner', ignore_index=True)

# Duplicated df_concat
df_concat_copy = df_concat.copy(deep=True)

mitosheet.sheet(df_concat, df_concat_copy)
  1. Open the Merge taskpane. While the first merge is loading, edit the merge configuration and notice that multiple new tables are returned.

https://github.com/mito-ds/mito/assets/18709905/613759c3-020c-49d1-b1d2-cc12e15da6aa

Note that this is at least somewhat separate from Mito internal state management as this occurs on a fresh Mitosheet.

We should start by profiling this to see where the operation is running. Possible solutions are:

  1. Passing the new sheet index as a parameter to the step that we calculate based on the SheetDataArray in the frontend when the taskpane opens.
  2. Ensure they use the same stepID?
aarondr77 commented 1 year ago

This was reported by M.B and then confirmed here and on an enterprise JupyterHub this week.

aarondr77 commented 1 year ago

We should investigate if the same issue occurs on pivot tables, concat, and other sheet creating events.

naterush commented 12 months ago

This is 95% just a race condition about when the step id of a valid step gets saved. I think we can easily fix this by saving the step ID before the step completes, and then popping it if it generates an error.