Closed pseudotensor closed 5 years ago
Here is pip freeze:
absl-py==0.7.1
alabaster==0.7.11
aniso8601==6.0.0
apipkg==1.4
appdirs==1.4.2
asn1crypto==0.24.0
astor==0.7.1
astroid==1.6.5
atomicwrites==1.3.0
attrs==19.1.0
autocorrect==0.3.0
awscli==1.15.49
azure-common==1.1.18
azure-nspkg==2.0.0
azure-storage==0.34.2
Babel==2.6.0
backcall==0.1.0
bcrypt==3.1.1
beautifulsoup4==4.7.1
benford-py==0.1.0.3
bleach==2.1.3
blessed==1.15.0
boto==2.49.0
boto3==1.5.9
botocore==1.8.23
bs4==0.0.1
bz2file==0.98
cachetools==2.1.0
certifi==2019.3.9
cffi==1.11.5
chardet==3.0.4
Click==7.0
cloudpickle==0.8.1
colorama==0.3.9
coverage==4.5.1
cryptography==2.2.2
cycler==0.10.0
Cython==0.28.3
datatable==0.8.0.dev14
debtcollector==1.21.0
decorator==4.3.0
dill==0.2.8.2
docutils==0.14
docxtpl==0.5.1
EasyProcess==0.2.5
entrypoints==0.2.3
enum34==1.1.6
execnet==1.5.0
fasteners==0.14.1
feather-format==0.4.0
filelock==3.0.4
Flask==1.0.2
Flask-RESTful==0.3.6
future==0.16.0
gast==0.2.2
gensim==3.3.0
google-api-core==1.3.0
google-auth==1.5.1
google-cloud-bigquery==1.5.0
google-cloud-core==0.28.0
google-cloud-storage==1.10.0
google-resumable-media==0.3.1
googleapis-common-protos==1.5.9
grpcio==1.19.0
h2o==3.22.0.2
h2o4gpu==0.3.1.10000
h2oai==1.6.2
h2oai-client==1.6.2
h2oaicore==1.6.2
h5py==2.7.1
hanging-threads==2.0.3
hashids==1.2.0
holidays==0.9.5
html5lib==1.0.1
idna==2.7
ijson==2.3
imagesize==1.0.0
ipykernel==4.8.2
ipython==6.5.0
ipython-genutils==0.2.0
ipywidgets==7.4.2
iso8601==0.1.12
isort==4.3.4
itsdangerous==1.1.0
javabridge==1.0.18
jedi==0.12.1
Jinja2==2.10
jmespath==0.9.3
jsonpickle==1.0
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==6.0.0
jupyter-core==4.4.0
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
kiwisolver==1.0.1
lazy-object-proxy==1.3.1
ldap3==2.4.1
liac-arff==2.4.0
llvmlite==0.23.0
loky==2.4.0
lxml==4.3.3
Markdown==3.1
MarkupSafe==1.0
matplotlib==2.2.2
mccabe==0.6.1
memory-profiler==0.47
minio==4.0.5
mistune==0.8.3
mli==0.1.0+master.136
model2capnp==0.13.14+master.331
model2proto==2.18.5+master.331
monotonic==1.5
more-itertools==7.0.0
msgpack==0.5.6
multiprocess==0.70.5
mypy==0.501
natsort==5.1.1
nbconvert==5.3.1
nbformat==4.4.0
netaddr==0.7.19
netifaces==0.10.9
nltk==3.2.5
notebook==5.5.0
numba==0.35.0
numpy==1.15.1
openml==0.8.0
oslo.concurrency==3.29.1
oslo.config==6.8.1
oslo.i18n==3.23.1
oslo.utils==3.40.3
packaging==16.8
pandas==0.24.1
pandas-ml==0.7.0.dev0
pandocfilters==1.4.2
parso==0.3.1
patsy==0.5.0
pbr==5.1.3
pexpect==4.6.0
pickleshare==0.7.4
Pillow==5.0.0
pluggy==0.9.0
prompt-toolkit==1.0.15
protobuf==3.6.1
psutil==5.4.5
ptyprocess==0.6.0
py==1.5.2
pyarrow==0.9.0
pyasn1==0.4.4
pyasn1-modules==0.2.4
pycparser==2.18
pycrypto==2.6.1
pycryptodome==3.6.6
Pygments==2.2.0
PyJWT==1.7.1
pylint==1.8.4
pyOpenSSL==17.5.0
pyparsing==2.1.10
pyramid-arima==0.6.5
pytest==3.10.1
pytest-cov==2.5.1
pytest-forked==0.2
pytest-instafail==0.4.0
pytest-repeat==0.7.0
pytest-timeout==1.2.1
pytest-tldr==0.1.5
pytest-xdist==1.22.2
python-dateutil==2.7.2
python-docx==0.8.7
python-magic==0.4.15
python-pam==1.8.2
python-terraform==0.10.0
python-weka-wrapper3==0.1.5
pytz==2018.4
PyVirtualDisplay==0.2.1
PyYAML==3.12
pyzmq==17.1.0
qtconsole==4.4.3
redis==2.10.6
requests==2.20.0
rfc3986==1.2.0
rfpimp==1.3
rsa==3.4.2
s3transfer==0.1.13
scikit-learn==0.19.1
scipy==1.1.0
scoring-h2oai-experiment-ditigalu==1.0.0
seaborn==0.7.1
selenium==3.9.0
Send2Trash==1.5.0
setproctitle==1.1.10
sharedmem==0.3.5
simplegeneric==0.8.1
six==1.11.0
sklearn==0.0
smart-open==1.6.0
snowballstemmer==1.2.1
snowflake-connector-python==1.5.6
soupsieve==1.9.1
Sphinx==1.7.5
sphinx-rtd-theme==0.4.0
sphinxcontrib-websupport==1.1.0
statsmodels==0.9.0
stevedore==1.30.1
tabulate==0.8.2
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-gpu==1.11.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
thrift==0.11.0
toml==0.9.4
tornado==5.1.1
traitlets==4.3.2
treelite==0.32
typed-ast==1.0.4
typesentry==0.2.6
urllib3==1.23
virtualenv==15.1.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.2
widgetsnbextension==3.4.2
wrapt==1.10.11
xlrd==1.0.0
xmltodict==0.12.0
yapf==0.17.0
anything odd that would cause this?
Googling doesn't help, error seems to come from sklearn maybe, but unclear why happens. I understand if somehow arff was forced to be sparse numeric, then has to be numeric, but why is openml saying it's sparse arff when there is string/text data?
Here's what I'm running:
>>> import openml as oml
>>> apikey = <my key>
>>> oml.config.apikey = apikey
>>> openml_datadir = "./openml_data"
>>> import os
>>> os.makedirs(openml_datadir, exist_ok=True)
>>> openml_datadir_cache = openml_datadir + "/cache"
>>> os.makedirs(openml_datadir_cache, exist_ok=True)
>>> oml.config.set_cache_directory(openml_datadir_cache)
>>> dataset_id = 3873
>>> dataset = oml.datasets.get_dataset(dataset_id=dataset_id)
Traceback (most recent call last):
File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 214, in _data_arff_to_pickle
np.array(type_, dtype=np.float32)
ValueError: could not convert string to float: 'CHEMBL1079898'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 394, in get_dataset
description, features, qualities, arff_file
File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/functions.py", line 882, in _create_dataset_from_description
qualities=qualities,
File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 172, in __init__
self.data_pickle_file = self._data_arff_to_pickle(data_file)
File "/opt/h2oai/dai/python/lib/python3.6/site-packages/openml/datasets/dataset.py", line 217, in _data_arff_to_pickle
"Categorical data needs to be numeric when "
ValueError: Categorical data needs to be numeric when using sparse ARFF.
>>>
A fresh environment with same python version (3.6.3) with those exact requirements on ubuntu 16.04 doesn't have any issue.
getting only meta data does work:
>>> oml.datasets.get_dataset(dataset_id=dataset_id, download_data=False)
<openml.datasets.dataset.OpenMLDataset object at 0x7f6ad97e3470>
>>>
But odd that API shows this for insider the docker:
Help on function get_dataset in module openml.datasets.functions:
get_dataset(dataset_id:Union[int, str], download_data:bool=True) -> openml.datasets.dataset.OpenMLDataset
Download the OpenML dataset representation, optionally also download actual data file.
This function is thread/multiprocessing safe.
This function uses caching. A check will be performed to determine if the information has
previously been downloaded, and if so be loaded from disk instead of retrieved from the server.
Parameters
----------
dataset_id : int or str
Dataset ID of the dataset to download
download_data : bool, optional (default=True)
If True, also download the data file. Beware that some datasets are large and it might
make the operation noticeably slower. Metadata is also still retrieved.
If False, create the OpenMLDataset and only populate it with the metadata.
The data may later be retrieved through the `OpenMLDataset.get_data` method.
Returns
-------
dataset : :class:`openml.OpenMLDataset`
The downloaded dataset.
(END)
and this outside:
Help on function get_dataset in module openml.datasets.functions:
get_dataset(dataset_id)
Download a dataset.
TODO: explain caching!
This function is thread/multiprocessing safe.
Parameters
----------
dataset_id : int
Dataset ID of the dataset to download
Returns
-------
dataset : :class:`openml.OpenMLDataset`
The downloaded dataset.
(END)
both show oml.version of 0.8.0 Both show dataset.format of 'Sparse_ARFF'
uninstalling the development version and using pypi package worked.
Only getting this inside centos7 docker environment
E.g. problem with dataset_id = 3873