openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
666 stars 90 forks source link

Trouble uploading datasets to the test server #1159

Open sebffischer opened 2 years ago

sebffischer commented 2 years ago

I have tried with the API, as well as with through the website. When trying to upload a dataset to the test server, I encounter the following error:

A PHP Error was encountered

Severity: Warning

Message: simplexml_load_string(): Entity: line 70: parser error : Extra content at the end of the document

Filename: new/post.php

Line Number: 155

Backtrace:

File: /var/www/openml/OpenML/openml_OS/views/pages/frontend/new/post.php
Line: 155
Function: simplexml_load_string

File: /var/www/openml/OpenML/openml_OS/helpers/cms_helper.php
Line: 19
Function: view

File: /var/www/openml/OpenML/openml_OS/controllers/Frontend.php
Line: 89
Function: loadpage

File: /var/www/openml/OpenML/index.php
Line: 334
Function: require_once
joaquinvanschoren commented 2 years ago

This is just a warning. The test server is supposed to show these. The production server doesn't. Did you actually have a problem uploading the dataset?

sebffischer commented 2 years ago

Thanks for the clarification! However this does not work (for me):

from openml.datasets import create_dataset
import sklearn
import numpy as np
from sklearn import datasets
import openml

openml.config.apikey = "API_TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"

diabetes = sklearn.datasets.load_diabetes()
name = "Diabetes(scikit-learn)"
X = diabetes.data
y = diabetes.target
attribute_names = diabetes.feature_names
description = diabetes.DESCR

data = np.concatenate((X, y.reshape((-1, 1))), axis=1)
attribute_names = list(attribute_names)
attributes = [(attribute_name, "REAL") for attribute_name in attribute_names] + [
    ("class", "INTEGER")
]
citation = (
    "Bradley Efron, Trevor Hastie, Iain Johnstone and "
    "Robert Tibshirani (2004) (Least Angle Regression) "
    "Annals of Statistics (with discussion), 407-499"
)
paper_url = "https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf"

diabetes_dataset = create_dataset(
    # The name of the dataset (needs to be unique).
    # Must not be longer than 128 characters and only contain
    # a-z, A-Z, 0-9 and the following special characters: _\-\.(),
    name=name,
    # Textual description of the dataset.
    description=description,
    # The person who created the dataset.
    creator="Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani",
    # People who contributed to the current version of the dataset.
    contributor=None,
    # The date the data was originally collected, given by the uploader.
    collection_date="09-01-2012",
    # Language in which the data is represented.
    # Starts with 1 upper case letter, rest lower case, e.g. 'English'.
    language="English",
    # License under which the data is/will be distributed.
    licence="BSD (from scikit-learn)",
    # Name of the target. Can also have multiple values (comma-separated).
    default_target_attribute="class",
    # The attribute that represents the row-id column, if present in the
    # dataset.
    row_id_attribute=None,
    # Attribute or list of attributes that should be excluded in modelling, such as
    # identifiers and indexes. E.g. "feat1" or ["feat1","feat2"]
    ignore_attribute=None,
    # How to cite the paper.
    citation=citation,
    # Attributes of the data
    attributes=attributes,
    data=data,
    # A version label which is provided by the user.
    version_label="test",
    original_data_url="https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html",
    paper_url=paper_url,
)

diabetes_dataset.publish()
print(f"URL for dataset: {diabetes_dataset.openml_url}")

gives me

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 70, column 0
sebffischer commented 2 years ago

Similar for this:

import openml
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree

openml.config.apikey = "TEST_KEY"
openml.config.server = "https://test.openml.org/api/v1"
# NOTE: We are using dataset 68 from the test server: https://test.openml.org/d/68
dataset = openml.datasets.get_dataset(68)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

dataset = openml.datasets.get_dataset(17)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format="array", target=dataset.default_target_attribute
)
print(f"Categorical features: {categorical_indicator}")
transformer = compose.ColumnTransformer(
    [("one_hot_encoder", preprocessing.OneHotEncoder(categories="auto"), categorical_indicator)]
)
X = transformer.fit_transform(X)
clf.fit(X, y)

# Get a task
task = openml.tasks.get_task(403)

# Build any classifier or pipeline
clf = tree.DecisionTreeClassifier()

# Run the flow
run = openml.runs.run_model_on_task(clf, task)

print(run)

myrun = run.publish()
# For this tutorial, our configuration publishes to the test server
# as to not pollute the main server.
print(f"Uploaded to {myrun.openml_url}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sebi/.local/lib/python3.8/site-packages/openml/base.py", line 133, in publish
    xml_response = xmltodict.parse(response_text)
  File "/home/sebi/.local/lib/python3.8/site-packages/xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: junk after document element: line 61, column 6
joaquinvanschoren commented 2 years ago

Hi Seb,

I recently fixed a number of issues with the test server. Can you please check if this issue is now resolved?

Thanks!

sebffischer commented 2 years ago

It seems like listing datasets from the test server does not work. The other things I have not checked yet (e.g. upload) but will report when I did


> list_oml_data(test_server = TRUE)
INFO  [15:21:27.606] Retrieving JSON {url: `https://test.openml.org/api/v1/json/data/list/limit/1000`, authenticated: `TRUE`}
Error in parse_con(txt, bigint_as_char) :
  lexical error: invalid character inside string.
          t":"ARFF",    "md5_checksum":" <div style="border:1px solid
                     (right here) ------^
``