glue-viz / glue

Linked Data Visualizations Across Multiple Files
http://glueviz.org
Other
740 stars 153 forks source link

Pandas DataFrames with type == 'object' cannot be save/restored #2330

Open jfoster17 opened 2 years ago

jfoster17 commented 2 years ago

Describe the bug Pandas DataFrames created within glue and added to the data_collection manager may have columns of type 'object', which mean they cannot be save/restored by glue (glue.core.state._load_numpy calls np.load()without allow_pickle=True). This is generally not a problem when reading files using the Pandas data_factory (which converts columns), but does, for instance cause problems for datasets retrieved from external sources within a glue session.

To Reproduce Steps to reproduce the behavior such as:

  1. Create a Pandas DataFrame within glue and add it to the data_collection. For instance, one might use the process described in the documentation
    df1 = DataFrame()
    df1['a'] = [1.2, 3.4, 2.9]
    df1['g'] = ['r', 'q', 's']
    dc['dataframe'] = df1
  2. Save Session (this new Data object will be stored as a numpy array within the session file since it did not come from an external file)
  3. Restore Session
  4. Get the following error:

    value error: 'Object arrays cannot be loaded when allow_pickle=False'

Expected behavior Pandas objects created within glue should not break session files.

We could simply add allow_pickle to np.load(), but perhaps this has undesired side effects?

Details:

Additional context Sample session file attached: pandas_dataframe_session.glu.gz