openml / openml-python

Python module to interface with OpenML
https://openml.github.io/openml-python/main/
Other
275 stars 143 forks source link

Issue downloading dataset #942

Open MichaelMMeskhi opened 3 years ago

MichaelMMeskhi commented 3 years ago

Description

When using dataset = openml.datasets.get_dataset(did), a Bad @ATTRIBUTE is thrown.

Steps/Code to Reproduce

import openml
dataset = openml.datasets.get_dataset(did)

Expected Results

No errors thrown.

Actual Results

 File "mfe_3.py", line 33, in <module>
    dataset = openml.datasets.get_dataset(data)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 530, in get_dataset
    description, features, qualities, arff_file
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 1023, in _create_dataset_from_description
    qualities=qualities,
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 183, in __init__
    self.data_pickle_file = self._create_pickle_in_cache(data_file)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 423, in _create_pickle_in_cache
    X, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 316, in _parse_data_from_arff
    data = self._get_arff(self.format)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 295, in _get_arff
    return decode_arff(fh)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 288, in decode_arff
    return_type=return_type)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 895, in decode
    raise e
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 892, in decode
    matrix_type=return_type)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 822, in _decode
    attr = self._decode_attribute(row)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 764, in _decode_attribute
    raise BadAttributeType()
arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2

Versions

macOS-10.15.6-x86_64-i386-64bit
Python 3.8.3 (default, Jul  2 2020, 11:26:31) 
[Clang 10.0.0 ]
NumPy 1.18.5
SciPy 1.5.0
Scikit-Learn 0.23.1
OpenML 0.10.2
mfeurer commented 3 years ago

Could you please give some details for which dataset(s) this happens?

MichaelMMeskhi commented 3 years ago

Could you please give some details for which dataset(s) this happens?

Please see the attached list for did that I found so far.

mfe_medium_errors.txt

MichaelMMeskhi commented 3 years ago

They all seem to be from one source FOREX trading data.

did                                                   41764
name                                  FOREX_gbpusd-day-High
version                                                   1
uploader                                                  1
status                                               active
format                                                 arff
MajorityClassSize                                       937
MaxNominalAttDistinctValues                               2
MinorityClassSize                                       897
NumberOfClasses                                           2
NumberOfFeatures                                         12
NumberOfInstances                                      1834
NumberOfInstancesWithMissingValues                        0
NumberOfMissingValues                                     0
NumberOfNumericFeatures                                  11
mfeurer commented 3 years ago

Hey, the issue here is that this data set contains fields of type 'date', which are not supported by the arff parser in python. There's an open PR to support that (https://github.com/renatopp/liac-arff/pull/67), but it's gone stale. We'd be happy if you like to pick that up.

MichaelMMeskhi commented 3 years ago

@mfeurer I will look into it and try to see what I can do about it. Thanks for the feedback!

joaquinvanschoren commented 2 years ago

Hi all, is there any progress on this issue?

PGijsbers commented 2 years ago

Yes/No.

Yes: Since 0.12.0 the get_dataset call should no longer raise the error because the data is not actually loaded anymore with that call. This means you get access to the dataset object and metadata.

No: The ARFF parser still does not support the data type in the ARFF file. As soon as you actually try to load the data (e.g. OpenMLDataset.get_data() the same error is thrown.

To me it makes the most sense to wait until the dataset is available in parquet format, since that should hopefully work without issues (and if not, it's worthwhile to improve the parquet support).

chclam commented 2 years ago

Hi, I'm a master's student under supervision of @joaquinvanschoren.

As a temporary fix, you could convert the timestamps to a Unix timestamp format and hint to ARFF that it is a numeric type.

I've made some quick adjustments in the decode_arff function that does exactly that, check it out at: https://github.com/chclam/openml-python/commit/136c27940b3cb9974e8272bae67007b4e1be5dc8