openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 144 forks source link

Uploading datasets with string columns to openml via api fails #1123

Open Louquinze opened 2 years ago

Louquinze commented 2 years ago

More informative Code to Reproduce

import requests
import pandas as pd
import openml
from openml.datasets.functions import create_dataset

# upload to test server
openml.config.start_using_configuration_for_example()

url = 'https://zenodo.org/record/3665663/files/dataset.csv?download=1'
r = requests.get(url)
with open('cybertroll.csv', 'wb') as f:
    f.write(r.content)
df = pd.read_csv("cybertroll.csv")
# uncomment the next line fixes the problem (backslashes at the end of lines are deleted)
# df['content'] = df['content'].str.replace(r'\W', '')

cybertroll_dataset = create_dataset(
    name="Cybertroll",
    description="Tweets classified as aggressive or not to help fight trolls.",
    creator="Saima Sadiq",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence="Creative Commons Attribution 1.0 Generic",
    default_target_attribute="annotation",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url="https://zenodo.org/record/3665663",
    paper_url="https://zenodo.org/record/3665663",
)

cybertroll_dataset.publish()
print(f"URL for dataset: {cybertroll_dataset.openml_url}")

Steps/Code to Reproduce

import pandas as pd

import openml
from openml.datasets.functions import create_dataset

openml.config.start_using_configuration_for_example()

# the error occurs only if the double backslash is at the end of the string
# uncomment line 12 and delete line 13, the upload is successfull

# df = pd.DataFrame({"X1": [1], "X2": [r"\\test"], "y": [1]}).astype({"X2": "string"})
df = pd.DataFrame({"X1": [1], "X2": [r"test\\"], "y": [1]}).astype({"X2": "string"})
dummy_dataset = create_dataset(
    name="DummyDataset",
    description="dummy dataset",
    creator="Lukas Strack",
    contributor=None,
    collection_date="11-17-2021",
    language="English",
    licence=None,
    default_target_attribute="y",
    row_id_attribute=None,
    ignore_attribute=None,
    citation="dummy citation",
    attributes="auto",
    data=df,
    version_label="test",
    original_data_url=None,
    paper_url=None,
)

dummy_dataset.publish()
print(f"URL for dataset: {dummy_dataset.openml_url}")

Expected Results

python print(f"URL for dataset: {dummy_dataset.openml_url}") should output something like python URL for dataset: https://test.openml.org/d/4005

Actual Results

/home/lukas/anaconda3/envs/Hackathon/lib/python3.9/site-packages/openml/config.py:177: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!
  warnings.warn(
Traceback (most recent call last):
  File "*working_directory*/upload_arff_error.py", line 33, in <module>
    dummy_dataset.publish()
  File "*conda_env_path*/lib/python3.9/site-packages/openml/base.py", line 130, in publish
    response_text = openml._api_calls._perform_api_call(
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 65, in _perform_api_call
    response = _read_url_files(url, data=data, file_elements=file_elements)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 197, in _read_url_files
    response = _send_request(request_method="post", url=url, data=data, files=file_elements,)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 248, in _send_request
    __check_response(response=response, url=url, file_elements=files)
  File "*conda_env_path*/lib/python3.9/site-packages/openml/_api_calls.py", line 295, in __check_response
    raise __parse_server_exception(response, url, file_elements=file_elements)
openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/ returned code 145: Error parsing dataset ARFF file - Arff error in dataset file: missing trailing quote in string (l.9)
joaquinvanschoren commented 2 years ago

Thanks for reporting! I'll transfer this to the issue tracker of the python API.

mfeurer commented 2 years ago

Hi @joaquinvanschoren this is 99.9% not a Python issue as this is an error message emitted by the server. The arff file produced and uploaded can be opened in WEKA without any issues so we assume that this is the PHP upload checker (not the one in the example as this is a minimal working example). I'll discuss with @Louquinze how to produce a more elaborate example that can be used to get a full-blown arff file to also get an arff file to be loaded in WEKA.

Louquinze commented 2 years ago

I edited the issues like Matthias stated previously.

joaquinvanschoren commented 2 years ago

Might be a fault in the PHP ARFF checker. Double backslash is changed which makes the test fail, perhaps?

mfeurer commented 2 years ago

We will re-evaluate this once Parquet-upload is available.