openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
666 stars 90 forks source link

ValueError: Categorical data needs to be numeric when using sparse ARFF #956

Closed pseudotensor closed 5 years ago

pseudotensor commented 5 years ago
    dataset = oml.datasets.get_dataset(dataset_id=dataset_id)
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 394, in get_dataset
    description, features, qualities, arff_file
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 882, in _create_dataset_from_description
    qualities=qualities,
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 172, in __init__
    self.data_pickle_file = self._data_arff_to_pickle(data_file)
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 217, in _data_arff_to_pickle
    "Categorical data needs to be numeric when "
ValueError: Categorical data needs to be numeric when using sparse ARFF.

Only getting this inside centos7 docker environment

E.g. problem with dataset_id = 3873

pseudotensor commented 5 years ago

Here is pip freeze:

absl-py==0.7.1
alabaster==0.7.11
aniso8601==6.0.0
apipkg==1.4
appdirs==1.4.2
asn1crypto==0.24.0
astor==0.7.1
astroid==1.6.5
atomicwrites==1.3.0
attrs==19.1.0
autocorrect==0.3.0
awscli==1.15.49
azure-common==1.1.18
azure-nspkg==2.0.0
azure-storage==0.34.2
Babel==2.6.0
backcall==0.1.0
bcrypt==3.1.1
beautifulsoup4==4.7.1
benford-py==0.1.0.3
bleach==2.1.3
blessed==1.15.0
boto==2.49.0
boto3==1.5.9
botocore==1.8.23
bs4==0.0.1
bz2file==0.98
cachetools==2.1.0
certifi==2019.3.9
cffi==1.11.5
chardet==3.0.4
Click==7.0
cloudpickle==0.8.1
colorama==0.3.9
coverage==4.5.1
cryptography==2.2.2
cycler==0.10.0
Cython==0.28.3
datatable==0.8.0.dev14
debtcollector==1.21.0
decorator==4.3.0
dill==0.2.8.2
docutils==0.14
docxtpl==0.5.1
EasyProcess==0.2.5
entrypoints==0.2.3
enum34==1.1.6
execnet==1.5.0
fasteners==0.14.1
feather-format==0.4.0
filelock==3.0.4
Flask==1.0.2
Flask-RESTful==0.3.6
future==0.16.0
gast==0.2.2
gensim==3.3.0
google-api-core==1.3.0
google-auth==1.5.1
google-cloud-bigquery==1.5.0
google-cloud-core==0.28.0
google-cloud-storage==1.10.0
google-resumable-media==0.3.1
googleapis-common-protos==1.5.9
grpcio==1.19.0
h2o==3.22.0.2
h2o4gpu==0.3.1.10000
h2oai==1.6.2
h2oai-client==1.6.2
h2oaicore==1.6.2
h5py==2.7.1
hanging-threads==2.0.3
hashids==1.2.0
holidays==0.9.5
html5lib==1.0.1
idna==2.7
ijson==2.3
imagesize==1.0.0
ipykernel==4.8.2
ipython==6.5.0
ipython-genutils==0.2.0
ipywidgets==7.4.2
iso8601==0.1.12
isort==4.3.4
itsdangerous==1.1.0
javabridge==1.0.18
jedi==0.12.1
Jinja2==2.10
jmespath==0.9.3
jsonpickle==1.0
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==6.0.0
jupyter-core==4.4.0
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
kiwisolver==1.0.1
lazy-object-proxy==1.3.1
ldap3==2.4.1
liac-arff==2.4.0
llvmlite==0.23.0
loky==2.4.0
lxml==4.3.3
Markdown==3.1
MarkupSafe==1.0
matplotlib==2.2.2
mccabe==0.6.1
memory-profiler==0.47
minio==4.0.5
mistune==0.8.3
mli==0.1.0+master.136
model2capnp==0.13.14+master.331
model2proto==2.18.5+master.331
monotonic==1.5
more-itertools==7.0.0
msgpack==0.5.6
multiprocess==0.70.5
mypy==0.501
natsort==5.1.1
nbconvert==5.3.1
nbformat==4.4.0
netaddr==0.7.19
netifaces==0.10.9
nltk==3.2.5
notebook==5.5.0
numba==0.35.0
numpy==1.15.1
openml==0.8.0
oslo.concurrency==3.29.1
oslo.config==6.8.1
oslo.i18n==3.23.1
oslo.utils==3.40.3
packaging==16.8
pandas==0.24.1
pandas-ml==0.7.0.dev0
pandocfilters==1.4.2
parso==0.3.1
patsy==0.5.0
pbr==5.1.3
pexpect==4.6.0
pickleshare==0.7.4
Pillow==5.0.0
pluggy==0.9.0
prompt-toolkit==1.0.15
protobuf==3.6.1
psutil==5.4.5
ptyprocess==0.6.0
py==1.5.2
pyarrow==0.9.0
pyasn1==0.4.4
pyasn1-modules==0.2.4
pycparser==2.18
pycrypto==2.6.1
pycryptodome==3.6.6
Pygments==2.2.0
PyJWT==1.7.1
pylint==1.8.4
pyOpenSSL==17.5.0
pyparsing==2.1.10
pyramid-arima==0.6.5
pytest==3.10.1
pytest-cov==2.5.1
pytest-forked==0.2
pytest-instafail==0.4.0
pytest-repeat==0.7.0
pytest-timeout==1.2.1
pytest-tldr==0.1.5
pytest-xdist==1.22.2
python-dateutil==2.7.2
python-docx==0.8.7
python-magic==0.4.15
python-pam==1.8.2
python-terraform==0.10.0
python-weka-wrapper3==0.1.5
pytz==2018.4
PyVirtualDisplay==0.2.1
PyYAML==3.12
pyzmq==17.1.0
qtconsole==4.4.3
redis==2.10.6
requests==2.20.0
rfc3986==1.2.0
rfpimp==1.3
rsa==3.4.2
s3transfer==0.1.13
scikit-learn==0.19.1
scipy==1.1.0
scoring-h2oai-experiment-ditigalu==1.0.0
seaborn==0.7.1
selenium==3.9.0
Send2Trash==1.5.0
setproctitle==1.1.10
sharedmem==0.3.5
simplegeneric==0.8.1
six==1.11.0
sklearn==0.0
smart-open==1.6.0
snowballstemmer==1.2.1
snowflake-connector-python==1.5.6
soupsieve==1.9.1
Sphinx==1.7.5
sphinx-rtd-theme==0.4.0
sphinxcontrib-websupport==1.1.0
statsmodels==0.9.0
stevedore==1.30.1
tabulate==0.8.2
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-gpu==1.11.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
thrift==0.11.0
toml==0.9.4
tornado==5.1.1
traitlets==4.3.2
treelite==0.32
typed-ast==1.0.4
typesentry==0.2.6
urllib3==1.23
virtualenv==15.1.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.2
widgetsnbextension==3.4.2
wrapt==1.10.11
xlrd==1.0.0
xmltodict==0.12.0
yapf==0.17.0

anything odd that would cause this?

pseudotensor commented 5 years ago

Googling doesn't help, error seems to come from sklearn maybe, but unclear why happens. I understand if somehow arff was forced to be sparse numeric, then has to be numeric, but why is openml saying it's sparse arff when there is string/text data?

pseudotensor commented 5 years ago

Here's what I'm running:

>>> import openml as oml                                                                                                                                                                                                                                                          
>>> apikey = <my key>
>>> oml.config.apikey = apikey                                                                                                                                                                                                                                                    
>>> openml_datadir = "./openml_data"                                                                                                                                                                                                                                              
>>> import os                                                                                                                                                                                                                                                                     
>>> os.makedirs(openml_datadir, exist_ok=True)                                                                                                                                                                                                                                    
>>> openml_datadir_cache = openml_datadir + "/cache"                                                                                                                                                                                                                              
>>> os.makedirs(openml_datadir_cache, exist_ok=True)                                                                                                                                                                                                                              
>>> oml.config.set_cache_directory(openml_datadir_cache)
>>> dataset_id = 3873
>>> dataset = oml.datasets.get_dataset(dataset_id=dataset_id)
Traceback (most recent call last):
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 214, in _data_arff_to_pickle
    np.array(type_, dtype=np.float32)
ValueError: could not convert string to float: 'CHEMBL1079898'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 394, in get_dataset
    description, features, qualities, arff_file
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 882, in _create_dataset_from_description
    qualities=qualities,
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 172, in __init__
    self.data_pickle_file = self._data_arff_to_pickle(data_file)
  File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 217, in _data_arff_to_pickle
    "Categorical data needs to be numeric when "
ValueError: Categorical data needs to be numeric when using sparse ARFF.
>>> 
pseudotensor commented 5 years ago

A fresh environment with same python version (3.6.3) with those exact requirements on ubuntu 16.04 doesn't have any issue.

pseudotensor commented 5 years ago

getting only meta data does work:

>>> oml.datasets.get_dataset(dataset_id=dataset_id, download_data=False)
<openml.datasets.dataset.OpenMLDataset object at 0x7f6ad97e3470>
>>> 
pseudotensor commented 5 years ago

But odd that API shows this for insider the docker:

Help on function get_dataset in module openml.datasets.functions:

get_dataset(dataset_id:Union[int, str], download_data:bool=True) -> openml.datasets.dataset.OpenMLDataset
    Download the OpenML dataset representation, optionally also download actual data file.

    This function is thread/multiprocessing safe.
    This function uses caching. A check will be performed to determine if the information has
    previously been downloaded, and if so be loaded from disk instead of retrieved from the server.

    Parameters
    ----------
    dataset_id : int or str
        Dataset ID of the dataset to download
    download_data : bool, optional (default=True)
        If True, also download the data file. Beware that some datasets are large and it might
        make the operation noticeably slower. Metadata is also still retrieved.
        If False, create the OpenMLDataset and only populate it with the metadata.
        The data may later be retrieved through the `OpenMLDataset.get_data` method.

    Returns
    -------
    dataset : :class:`openml.OpenMLDataset`
        The downloaded dataset.
(END)

and this outside:

Help on function get_dataset in module openml.datasets.functions:

get_dataset(dataset_id)
    Download a dataset.

    TODO: explain caching!

    This function is thread/multiprocessing safe.

    Parameters
    ----------
    dataset_id : int
        Dataset ID of the dataset to download

    Returns
    -------
    dataset : :class:`openml.OpenMLDataset`
        The downloaded dataset.
(END)

both show oml.version of 0.8.0 Both show dataset.format of 'Sparse_ARFF'

pseudotensor commented 5 years ago

uninstalling the development version and using pypi package worked.