Open prabhant opened 2 years ago
TODOs:
@mitar @PGijsbers @joaquinvanschoren Please list all issues related to sparse datasets here. As well as the IDs of mislabeled datasets
Somebody should run a script to check all datasets with missing Parquet file if they are all marked as sparse.
Not all datasets without parquet are necessarily sparse (there's also currently an issue with datasets containing datetime info). That said, it shouldn't take more than a small page of code for a script that checks all datasets against their metadata. If I member correctly, the meta-data of whether or not a dataset is sparse is directly stored on OpenMLDataset objects in openml-python
(.format
).
Update regarding the issue. I have not been able to find any solution to convert sparse pandas dataframe to parquet directly. I have asked the parquet community forums for help. One way this can be done is to convert the dataframe to dense dataframe to change the dypte of the arrays and then save it to parquet files. @PGijsbers That will require changes in the Openml-python to read the dataframe first as a dense frame and then convert the arrays to sparse arrays in pandas(we can have the sparse attribute in the metadata to identify the dataset as sparse or not). Do you think that is a reasonable workflow?
Does parquet even support sparse data? Or have you decided on a format which does not?
Parquet supports sparse data.
It seems that since pandas 1.0.0, you need to make a pandas dataframe with columns of type SparseArray, and to_parquet should then 'just work'?
test code https://gist.github.com/prabhant/dfd25b894afbf4d102f7abee23376c41 Please test it out on a few sparse datasets
@mfeurer @mitar for reference
In the provided example we lose some data compared to the sparse arff file itself. The sparse arff contains data:
@data
{1 83.683,3 4,4 4,5 0.47,6 5,10 12.8,12 -0.229,13 -0.348,15 1.226,16 63.504,26 -0.264,27 83.683,28 13.894,29 4,30 1.417,31 3.07,33 4.583,34 1,35 16.663,36 67.02,37 16.981,38 0.803,39 1.392,40 82.698,42 1.358,43 3.323,44 0.913,46 1.119,47 6,48 4.953,49 3.016,50 4.199,51 6.421,52 1.206,53 6.716,54 2,56 1.106,58 4.82,59 0.119,60 0.293,61 -0.208,62 0.621,63 68.376,64 -0.247,66 11.935,67 7.62}
{0 CHEMBL1077387,1 83.683,3 4,4 4,6 5,10 19.2,12 -0.221,13 -0.33,15 1.181,16 61.552,26 -0.309,27 83.683,28 6.264,29 2,30 1.256,31 2.918,33 2.153,35 16.663,36 67.02,37 16.586,38 1.093,39 1.062,40 74.095,42 1.107,43 2.741,44 0.906,46 1.095,47 4,48 2.03,49 2.854,50 3.757,51 3.141,52 1.154,53 2.773,54 1,56 0.821,58 4.809,59 -0.09,60 -0.054,61 -0.315,62 0.559,63 55.181,64 -0.165,66 3.975,67 6.886}
neither the old nor new dataframe contain the molecule id properly (specifically, the molecule id). Old:
>>> molecule_id P_VSA_e_3 C.039 N.075 ... MATS7i nCbH ATSC7m pXC50
0 0.0 83.682999 0.0 4.0 ... -0.247 0.0 11.935 7.620
1 1.0 83.682999 0.0 4.0 ... -0.165 0.0 3.975 6.886
new:
>>> molecule_id P_VSA_e_3 C.039 N.075 ... MATS7i nCbH ATSC7m pXC50
0 NaN 83.682999 NaN 4.0 ... -0.247 NaN 11.935 7.620
1 1.0 83.682999 NaN 4.0 ... -0.165 NaN 3.975 6.886
This seems to stem from encode_nominal=True
when reading the arff file. From what I can tell having a string gives the downstream error (<class 'TypeError'>, TypeError("no supported conversion for types: (dtype('<U32'),)"), <traceback object at 0x0000017BFDCAD400>)
when calling .tocsr
.
I think we probably should use the intermediate output to generate the sparse parquet file. This will also avoid us accidentally encoding a 0
as nan
(with the provided code, 0
values may be encoded as nan
).
Here is a list of all datasets which I think are failing because of this issue and do not have parquet files:
This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend. Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default. Example