Open mitar opened 3 years ago
Oh, this get_csv
returns some very simple transformation of the arff file. This is why resulting file is often not a valid CSV file. That also explains #1084.
This makes get_csv
more or less useless. One would really hope you could just use API calls to interact with OpenML, but it seems bindings like Python package are really required to make use of this mess (but then it does not support datasets with dates). I really think all of that logic should be put into one place, API (which can heavily cache), because then it would work across different languages using the data. Now every language implements its own post-processing after download and that post-processing becomes part of the whole reproducibility pipeline.
Indeed, this get_csv call was meant as a convenience function rather than a production-ready feature.
We're currently moving to a new data infrastructure, based on S3 buckets, in which datasets would be available in parquet and (where possible) CSV. ARFF will then be phased out. It will then be possible to use more complex transformations in the backend and have automated quality checks.
If you look at the CSV file of this dataset you can see that it has inconsistent CSV dialect:
"
'
and'
is escaped as\'
Quote:
There is no way to configure
Pandas.read_csv
to read this file correctly. You can usebut then you get
"
in column names. Also, it is unclear how is"
escaped in column names (as""
or as\"
).