Open sebffischer opened 2 years ago
This is just a warning. The test server is supposed to show these. The production server doesn't. Did you actually have a problem uploading the dataset?
Thanks for the clarification! However this does not work (for me):
from openml.datasets import create_dataset
import sklearn
import numpy as np
from sklearn import datasets
import openml
openml.config.apikey = "API_TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"
diabetes = sklearn.datasets.load_diabetes()
name = "Diabetes(scikit-learn)"
X = diabetes.data
y = diabetes.target
attribute_names = diabetes.feature_names
description = diabetes.DESCR
data = np.concatenate((X, y.reshape((-1, 1))), axis=1)
attribute_names = list(attribute_names)
attributes = [(attribute_name, "REAL") for attribute_name in attribute_names] + [
("class", "INTEGER")
]
citation = (
"Bradley Efron, Trevor Hastie, Iain Johnstone and "
"Robert Tibshirani (2004) (Least Angle Regression) "
"Annals of Statistics (with discussion), 407-499"
)
paper_url = "https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf"
diabetes_dataset = create_dataset(
# The name of the dataset (needs to be unique).
# Must not be longer than 128 characters and only contain
# a-z, A-Z, 0-9 and the following special characters: _\-\.(),
name=name,
# Textual description of the dataset.
description=description,
# The person who created the dataset.
creator="Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani",
# People who contributed to the current version of the dataset.
contributor=None,
# The date the data was originally collected, given by the uploader.
collection_date="09-01-2012",
# Language in which the data is represented.
# Starts with 1 upper case letter, rest lower case, e.g. 'English'.
language="English",
# License under which the data is/will be distributed.
licence="BSD (from scikit-learn)",
# Name of the target. Can also have multiple values (comma-separated).
default_target_attribute="class",
# The attribute that represents the row-id column, if present in the
# dataset.
row_id_attribute=None,
# Attribute or list of attributes that should be excluded in modelling, such as
# identifiers and indexes. E.g. "feat1" or ["feat1","feat2"]
ignore_attribute=None,
# How to cite the paper.
citation=citation,
# Attributes of the data
attributes=attributes,
data=data,
# A version label which is provided by the user.
version_label="test",
original_data_url="https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html",
paper_url=paper_url,
)
diabetes_dataset.publish()
print(f"URL for dataset: {diabetes_dataset.openml_url}")
gives me
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
xml_response = xmltodict.parse(response_text)
File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 70, column 0
Similar for this:
import openml
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree
openml.config.apikey = "TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"
# NOTE: We are using dataset 68 from the test server: https://test.openml.org/d/68
dataset = openml.datasets.get_dataset(68)
X, y, categorical_indicator, attribute_names = dataset.get_data(
dataset_format="array", target=dataset.default_target_attribute
)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
dataset = openml.datasets.get_dataset(17)
X, y, categorical_indicator, attribute_names = dataset.get_data(
dataset_format="array", target=dataset.default_target_attribute
)
print(f"Categorical features: {categorical_indicator}")
transformer = compose.ColumnTransformer(
[("one_hot_encoder", preprocessing.OneHotEncoder(categories="auto"), categorical_indicator)]
)
X = transformer.fit_transform(X)
clf.fit(X, y)
# Get a task
task = openml.tasks.get_task(403)
# Build any classifier or pipeline
clf = tree.DecisionTreeClassifier()
# Run the flow
run = openml.runs.run_model_on_task(clf, task)
print(run)
myrun = run.publish()
# For this tutorial, our configuration publishes to the test server
# as to not pollute the main server.
print(f"Uploaded to {myrun.openml_url}")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
xml_response = xmltodict.parse(response_text)
File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 61, column 6
Hi Seb,
I recently fixed a number of issues with the test server. Can you please check if this issue is now resolved?
Thanks!
It seems like listing datasets from the test server does not work. The other things I have not checked yet (e.g. upload) but will report when I did
> list_oml_data(test_server = TRUE)
INFO [15:21:27.606] Retrieving JSON {url: `https://test.openml.org/api/v1/json/data/list/limit/1000`, authenticated: `TRUE`}
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid character inside string.
t":"ARFF", "md5_checksum":" <div style="border:1px solid
(right here) ------^
``
I have tried with the API, as well as with through the website. When trying to upload a dataset to the test server, I encounter the following error: