OpenML Sparse Dataset support

prabhant commented 2 years ago

This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend. Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default. Example

did = 42379
  d = openml.datasets.get_dataset(did, download_qualities=False)
  df , *_ = d.get_data(dataset_format="dataframe", include_row_id=True, include_ignore_attribute=True)
  df.to_parquet(f'dataset_{d.id}.pq')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-42ca2d7c4839> in <module>
      7                                       target=d.default_target_attribute)
      8     df = pd.concat([X,y], axis=1)
----> 9     df.to_parquet(f'dataset_{d.id}.pq')
     10     client.make_bucket(f"dataset{did}")
     11     client.fput_object(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2453         from pandas.io.parquet import to_parquet
   2454 
-> 2455         return to_parquet(
   2456             self,
   2457             path,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    388     path_or_buf: FilePathOrBuffer = io.BytesIO() if path is None else path
    389 
--> 390     impl.write(
    391         df,
    392         path_or_buf,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    150             from_pandas_kwargs["preserve_index"] = index
    151 
--> 152         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    153 
    154         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551      index_columns,
    552      columns_to_convert,
--> 553      convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
    554                                                columns)
    555 

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _get_columns_to_convert(df, schema, preserve_index, columns)
    357 
    358         if _pandas_api.is_sparse(col):
--> 359             raise TypeError(
    360                 "Sparse pandas data (column {}) not supported.".format(name))
    361 

TypeError: Sparse pandas data (column FCFP6_1024_0) not supported.

prabhant commented 2 years ago

TODOs:

[ ] Make a Script to save sparse dataframes in parquet files
[ ] List all sparse datasets(Currently some sparse datasets are not labelled sparse)

prabhant commented 2 years ago

@mitar @PGijsbers @joaquinvanschoren Please list all issues related to sparse datasets here. As well as the IDs of mislabeled datasets

mitar commented 2 years ago

Somebody should run a script to check all datasets with missing Parquet file if they are all marked as sparse.

PGijsbers commented 2 years ago

Not all datasets without parquet are necessarily sparse (there's also currently an issue with datasets containing datetime info). That said, it shouldn't take more than a small page of code for a script that checks all datasets against their metadata. If I member correctly, the meta-data of whether or not a dataset is sparse is directly stored on OpenMLDataset objects in openml-python (.format).

prabhant commented 2 years ago

Update regarding the issue. I have not been able to find any solution to convert sparse pandas dataframe to parquet directly. I have asked the parquet community forums for help. One way this can be done is to convert the dataframe to dense dataframe to change the dypte of the arrays and then save it to parquet files. @PGijsbers That will require changes in the Openml-python to read the dataframe first as a dense frame and then convert the arrays to sparse arrays in pandas(we can have the sparse attribute in the metadata to identify the dataset as sparse or not). Do you think that is a reasonable workflow?

mitar commented 2 years ago

Does parquet even support sparse data? Or have you decided on a format which does not?

joaquinvanschoren commented 2 years ago

Parquet supports sparse data.

It seems that since pandas 1.0.0, you need to make a pandas dataframe with columns of type SparseArray, and to_parquet should then 'just work'?

https://github.com/pandas-dev/pandas/issues/26378

prabhant commented 2 years ago

test code https://gist.github.com/prabhant/dfd25b894afbf4d102f7abee23376c41 Please test it out on a few sparse datasets

prabhant commented 2 years ago

@mfeurer @mitar for reference

PGijsbers commented 2 years ago

In the provided example we lose some data compared to the sparse arff file itself. The sparse arff contains data:

@data
{1 83.683,3 4,4 4,5 0.47,6 5,10 12.8,12 -0.229,13 -0.348,15 1.226,16 63.504,26 -0.264,27 83.683,28 13.894,29 4,30 1.417,31 3.07,33 4.583,34 1,35 16.663,36 67.02,37 16.981,38 0.803,39 1.392,40 82.698,42 1.358,43 3.323,44 0.913,46 1.119,47 6,48 4.953,49 3.016,50 4.199,51 6.421,52 1.206,53 6.716,54 2,56 1.106,58 4.82,59 0.119,60 0.293,61 -0.208,62 0.621,63 68.376,64 -0.247,66 11.935,67 7.62}
{0 CHEMBL1077387,1 83.683,3 4,4 4,6 5,10 19.2,12 -0.221,13 -0.33,15 1.181,16 61.552,26 -0.309,27 83.683,28 6.264,29 2,30 1.256,31 2.918,33 2.153,35 16.663,36 67.02,37 16.586,38 1.093,39 1.062,40 74.095,42 1.107,43 2.741,44 0.906,46 1.095,47 4,48 2.03,49 2.854,50 3.757,51 3.141,52 1.154,53 2.773,54 1,56 0.821,58 4.809,59 -0.09,60 -0.054,61 -0.315,62 0.559,63 55.181,64 -0.165,66 3.975,67 6.886}

neither the old nor new dataframe contain the molecule id properly (specifically, the molecule id). Old:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          0.0   83.682999    0.0    4.0  ...  -0.247   0.0  11.935  7.620
1          1.0   83.682999    0.0    4.0  ...  -0.165   0.0   3.975  6.886

new:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          NaN   83.682999    NaN    4.0  ...  -0.247   NaN  11.935  7.620
1          1.0   83.682999    NaN    4.0  ...  -0.165   NaN   3.975  6.886

This seems to stem from encode_nominal=True when reading the arff file. From what I can tell having a string gives the downstream error (<class 'TypeError'>, TypeError("no supported conversion for types: (dtype('<U32'),)"), <traceback object at 0x0000017BFDCAD400>) when calling .tocsr.

I think we probably should use the intermediate output to generate the sparse parquet file. This will also avoid us accidentally encoding a 0 as nan (with the provided code, 0 values may be encoded as nan).

mitar commented 11 months ago

Here is a list of all datasets which I think are failing because of this issue and do not have parquet files:

'32359', '40864', '41079', '41111', '41120', '41121', '41122', '41204', '41205', '41206', '41238', '42807', '42825', '43100', '43105', '43113', '43123', '43127', '43138', '43140', '43180', '43190', '43192', '43194', '43198', '43252', '43256', '43303', '43304', '43305', '43306', '43307', '43308', '43309', '43310', '43311', '43312', '43313', '43315', '43318', '43319', '43320', '43321', '43322', '43323', '43324', '43325', '43326', '43327', '43328', '43331', '43332', '43335', '43336', '43337', '43338', '43339', '43340', '43341', '43342', '43343', '43344', '43345', '43346', '43347', '43348', '43349', '43350', '43351', '43352', '43353', '43354', '43355', '43356', '43357', '43358', '43360', '43361', '43362', '43363', '43364', '43365', '43366', '43367', '43368', '43369', '43370', '43371', '43372', '43373', '43374', '43375', '43376', '43377', '43378', '43379', '43380', '43381', '43382', '43383', '43384', '43385', '43386', '43387', '43388', '43389', '43390', '43391', '43392', '43393', '43394', '43395', '43396', '43397', '43398', '43399', '43400', '43401', '43402', '43403', '43404', '43405', '43406', '43407', '43408', '43409', '43410', '43412', '43413', '43414', '43415', '43416', '43417', '43419', '43420', '43421', '43422', '43423', '43424', '43425', '43426', '43427', '43428', '43430', '43431', '43432', '43433', '43434', '43435', '43436', '43437', '43438', '43439', '43440', '43441', '43442', '43443', '43445', '43446', '43447', '43448', '43449', '43450', '43451', '43452', '43453', '43454', '43455', '43456', '43457', '43458', '43459', '43460', '43461', '43463', '43464', '43465', '43466', '43467', '43468', '43470', '43471', '43472', '43473', '43474', '43475', '43476', '43477', '43478', '43479', '43480', '43481', '43482', '43483', '43484', '43485', '43486', '43487', '43488', '43489', '43490', '43491', '43492', '43493', '43495', '43496', '43497', '43498', '43499', '43500', '43501', '43502', '43503', '43504', '43505', '43506', '43507', '43508', '43509', '43510', '43511', '43512', '43513', '43515', '43516', '43517', '43518', '43519', '43520', '43521', '43522', '43523', '43524', '43525', '43526', '43527', '43528', '43529', '43530', '43531', '43532', '43533', '43534', '43535', '43536', '43537', '43538', '43539', '43540', '43541', '43542', '43543', '43544', '43545', '43546', '43547', '43548', '43549', '43550', '43551', '43552', '43553', '43554', '43555', '43556', '43557', '43558', '43559', '43560', '43561', '43562', '43563', '43564', '43565', '43566', '43567', '43568', '43569', '43570', '43571', '43572', '43573', '43574', '43575', '43576', '43577', '43578', '43579', '43580', '43581', '43582', '43583', '43584', '43585', '43586', '43587', '43588', '43589', '43590', '43591', '43592', '43593', '43594', '43595', '43596', '43597', '43598', '43599', '43600', '43601', '43602', '43603', '43604', '43605', '43606', '43607', '43608', '43609', '43610', '43611', '43612', '43613', '43614', '43615', '43616', '43617', '43618', '43619', '43620', '43621', '43622', '43623', '43624', '43625', '43626', '43627', '43628', '43630', '43631', '43633', '43634', '43635', '43636', '43637', '43638', '43639', '43640', '43641', '43642', '43643', '43644', '43645', '43646', '43647', '43648', '43649', '43650', '43651', '43652', '43653', '43654', '43655', '43656', '43657', '43658', '43659', '43660', '43661', '43662', '43663', '43664', '43665', '43666', '43667', '43668', '43669', '43670', '43671', '43672', '43673', '43674', '43675', '43676', '43677', '43678', '43679', '43680', '43681', '43682', '43683', '43684', '43685', '43686', '43687', '43688', '43689', '43690', '43691', '43692', '43694', '43695', '43696', '43697', '43698', '43699', '43700', '43701', '43702', '43703', '43704', '43705', '43706', '43707', '43708', '43709', '43710', '43711', '43712', '43713', '43714', '43715', '43716', '43717', '43718', '43719', '43720', '43721', '43722', '43723', '43724', '43725', '43726', '43727', '43728', '43729', '43730', '43731', '43733', '43734', '43735', '43736', '43737', '43738', '43739', '43740', '43741', '43742', '43743', '43744', '43745', '43746', '43747', '43748', '43749', '43750', '43751', '43752', '43753', '43754', '43755', '43756', '43757', '43758', '43759', '43760', '43761', '43762', '43763', '43764', '43765', '43766', '43767', '43768', '43769', '43770', '43771', '43772', '43773', '43774', '43775', '43776', '43777', '43778', '43779', '43780', '43781', '43782', '43783', '43784', '43785', '43786', '43787', '43788', '43789', '43790', '43791', '43792', '43793', '43794', '43795', '43796', '43797', '43798', '43799', '43800', '43801', '43802', '43803', '43804', '43805', '43806', '43807', '43808', '43809', '43810', '43811', '43812', '43814', '43815', '43816', '43817', '43818', '43819', '43820', '43821', '43822', '43823', '43824', '43825', '43826', '43827', '43828', '43829', '43830', '43831', '43832', '43833', '43834', '43835', '43836', '43837', '43838', '43839', '43840', '43841', '43842', '43843', '43844', '43845', '43846', '43847', '43848', '43849', '43850',

openml / openml-data

OpenML Sparse Dataset support #46