Inconsistent CSV dialect

openml / OpenML

Open Machine Learning

https://openml.org

BSD 3-Clause "New" or "Revised" License

669 stars 91 forks source link

Inconsistent CSV dialect #1129

Open mitar opened 3 years ago

mitar commented 3 years ago

If you look at the CSV file of this dataset you can see that it has inconsistent CSV dialect:

column names are quoted with "
attribute values are quoted with ' and ' is escaped as \'

Quote:

"history","session","user"
0000,'whoami pwd ls dir vi source <1> source <1> exit',USER0
0001,'whereis <1> mkdir <1> vi <1> vi <1> ls source <1>',USER0

There is no way to configure Pandas.read_csv to read this file correctly. You can use

            quotechar='\'',
            doublequote=False,
            escapechar='\\',`

but then you get " in column names. Also, it is unclear how is " escaped in column names (as "" or as \").

mitar commented 3 years ago

Oh, this get_csv returns some very simple transformation of the arff file. This is why resulting file is often not a valid CSV file. That also explains #1084.

mitar commented 3 years ago

This makes get_csv more or less useless. One would really hope you could just use API calls to interact with OpenML, but it seems bindings like Python package are really required to make use of this mess (but then it does not support datasets with dates). I really think all of that logic should be put into one place, API (which can heavily cache), because then it would work across different languages using the data. Now every language implements its own post-processing after download and that post-processing becomes part of the whole reproducibility pipeline.

joaquinvanschoren commented 3 years ago

Indeed, this get_csv call was meant as a convenience function rather than a production-ready feature.

We're currently moving to a new data infrastructure, based on S3 buckets, in which datasets would be available in parquet and (where possible) CSV. ARFF will then be phased out. It will then be possible to use more complex transformations in the backend and have automated quality checks.