openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
669 stars 91 forks source link

Dataset parsing issues #476

Open joaquinvanschoren opened 7 years ago

joaquinvanschoren commented 7 years ago

I'm running a script trying to download and parse all active datasets with Python. So far, I got these errors: Datasets have string features: 374, 376, 379, 380 Datasets with end-of-line comments behind attributes: 1074

I'll report back if more show up.

joaquinvanschoren commented 7 years ago

Update:

Datasets have string features: 374, 376, 379, 380, 373, 374, 376, 379, 380 Datasets with end-of-line comments behind attributes: 1074 Server error (The URI you submitted has disallowed characters.): 152, 153, 156, 157, 158, 159, 160 Could not download (connection broken): 250, 252, 254, 264, 271, 1183, 40517 Bad XML: 274 Processing error (unknown): 1597 No target feature: 4136, 4137, 4552, 1458, 1477, 1484, 1514, 1566 Invalid ARFF:

The code used for this (and check later):

dids = set()
for tid, task in tasks.items():
        dids.add(task['did'])

for did in dids:
    try:
        ds = openml.datasets.get_dataset(did)
        X, y = ds.get_data(target=ds.default_target_attribute)
    except Exception as e:
        print("error: ", did, sys.exc_info()[0], e)
mfeurer commented 7 years ago

@joaquinvanschoren We have a list of those in the python tracker: https://github.com/openml/openml-python/issues/310

joaquinvanschoren commented 7 years ago

Proposed actions:

amueller commented 7 years ago

Also: 1092, 1597

amueller commented 7 years ago

380 has more issues than just string features, same for 379 and the others around there.

mfeurer commented 7 years ago

@joaquinvanschoren @janvanrijn just wanted to report that there are data quality processing error for datasets 190, 372 and 1396.

joaquinvanschoren commented 7 years ago

@mfeurer just wanted to report that there are no more data quality processing errors for datasets 190, 372 and 1396 ;)

joaquinvanschoren commented 7 years ago

Update:

Still need to fix the 6 badly formatted datasets and check the 'connection broken' issue.

joaquinvanschoren commented 7 years ago

1037, 1039, 1042, 1074 are fixed 4675, 4709 remain 'in_preparation'. They seem like valid arff but have a 'date' attribute that doesn't seem to parse.

About the 'connection broken' issue, this seems tricky. I'm not sure whether this is a server issue or a python issue. I can download the dataset fine with wget and curl. E.g.:

wget https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff

This fails:

import requests
url = 'https://www.openml.org/data/download/5599'
r = requests.get(url)

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)'

I tried with different http headers, but the issue remains.

amueller commented 7 years ago

it could be that wget and curl just ignore the error. There might be a way to ignore it with requests, too.

amueller commented 7 years ago

When I used the /data/download interface, I usually got an incomplete error... using the actual file name fixed it. It looks like you're using the filename with wget and the incomplete url with requests. Is that incomplete URL a supported API or not? I asked that before, but it's not clear to me.

joaquinvanschoren commented 7 years ago

Interesting, can you give an example? I tried several datasets but it seems to work fine? As far as I know, using the filename or not shouldn't make a difference. @janvanrijn can explain better.

Using the full filename also doesn't help with requests:

url = 'https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff'
r = requests.get(url)
...
   raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)',

I checked whether it has something to do with the file size, but bigger datasets, like KDDCup99_full (https://www.openml.org/data/download/53993), do download without problems.

Maybe it is something specific about these files...

joaquinvanschoren commented 7 years ago

Small update. I ran another test trying to download all datasets. Remaining issues:

These were fixed straight away:

mfeurer commented 7 years ago

I just tried to figure out where the ChunkedEncodingError could come from. One possibility is that the server kills the connection after ~30 seconds, you can verify this behaviour by pasting this piece of code:

def _read_url(url, data=None):

    data = {} if data is None else data
    if config.apikey is not None:
        data['api_key'] = config.apikey

    if len(data) == 0 or (len(data) == 1 and 'api_key' in data):
        # do a GET
        response = requests.get(url, params=data, stream=True)
    else: # an actual post request
        # Using requests.post sets header 'Accept-encoding' automatically to
        #  'gzip,deflate'
        response = requests.post(url, data=data, stream=True)
    if '.arff' in url:
        import time
        st = time.time()
        iteration = 0
        for c in response.iter_content(10 * 1024):
            print(time.time() - st, iteration, c)
            iteration += 1

    if response.status_code != 200:
        raise _parse_server_exception(response, url=url)
    if 'Content-Encoding' not in response.headers or \
            response.headers['Content-Encoding'] != 'gzip':
        warnings.warn('Received uncompressed content from OpenML for %s.' % url)
    return response.text

into the file _api_calls.py, replacing the current _read_url function. It will not get to 30s.

I assume wget works because it is faster at downloading the file (takes only 12s on my machine).

I can also download the file in the browser, but then I only get half of the file, i.e. ~550000 lines instead of 1M lines.

mfeurer commented 7 years ago

Dataset 1414 contains date attributes, these are not supported at the moment.

amueller commented 7 years ago

I got an incomplete read when trying to do iris, so that was unrelated to the timing, I think.

mfeurer commented 7 years ago

Which iris ID? And is it reproducible?

amueller commented 7 years ago

Yes. And any. basically any dataset, when using the https://www.openml.org/data/download/<DATASETID> because (in my limited understanding) that's not actually part of the rest API.

joaquinvanschoren commented 7 years ago

Note that this API expects a file id, not a dataset id: https://www.openml.org/data/download/<FILE_ID>

It serves everything from datasets to uploaded runs, and the file ID is given by the 'normal' APIs, eg. /api/v1/data/<DATASET_ID>

For the first batch of datasets, the file_id is the same as the dataset_id, so iris can be downloaded by https://www.openml.org/data/download/61.

This works just fine for me. How can I reproduce your error?

joaquinvanschoren commented 7 years ago

If it helps, this is what I did trying to reproduce the incomplete read:

>>> import requests
>>> requests.get('https://www.openml.org/data/download/61')
<Response [200]>
joaquinvanschoren commented 7 years ago

@mfeurer Do you give a 'not supported' warning in the python API when a dataset has a 'date' attribute? They can remain active, right?

amueller commented 7 years ago

hm now that one works. Try

from urllib.request import urlopen
result = urlopen('https://www.openml.org/data/download/4550')
result.read()

requests swallows the incomplete read so I'm using urllib.

joaquinvanschoren commented 7 years ago

I set all timeouts to 300 seconds now. I also updated some compression settings.

Let me know if it helps, I didn't have time to run your code yet.

joaquinvanschoren commented 7 years ago

@amueller File 4550 is not actually a dataset, it's an uploaded run. I fixed the problem: the filelength was stored incorrectly in our database, hence the incorrect header. I'll do a check whether there are more of these errors with the run uploads.

If you want dataset 4550 (MiceProtein), that's file 1804243, as noted in the description: https://www.openml.org/api/v1/data/4550

joaquinvanschoren commented 7 years ago

Interestingly, this seems to work fine:

from urllib.request import urlopen
result = urlopen('https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff')
result.read()

While this still fails:

ds =  openml.datasets.get_dataset(250)
mfeurer commented 7 years ago

Hm, I also have to change my initial theory as the following works and takes more than 30 seconds, while still in python and the browser the download stops after 30 seconds:

feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget --limit-rate=10m https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff
--2017-11-02 13:29:47--  https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff
Resolving www.openml.org (www.openml.org)... 131.155.11.58
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 677722132 (646M) [text/plain]
Saving to: ‘BNG_mfeat-fourier.arff.1’

BNG_mfeat-fourier.arff.1                                    100%[==========================================================================================================================================>] 646,33M  10,4MB/s    in 70s     

2017-11-02 13:30:58 (9,25 MB/s) - ‘BNG_mfeat-fourier.arff.1’ saved [677722132/677722132]

real    1m10.973s
user    0m0.872s
sys 0m1.584s
mfeurer commented 7 years ago

Okay, I now copied the headers send by my browser one by one and this one reproduces the failure:

 time wget -d https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff --header "Accept-Encoding: gzip, deflate, br"

Edit: adding -t 2 reduces the amount of retries to 2 and the code fails reproducible within a minute. This explains why urllib works. So basically if your internet connection is fast enough you don't want the files to be compressed so that you can actually download them, but if your internet connection is 3.6 MB/s or less (actually what I have at home;)) you want compression.

janvanrijn commented 7 years ago

Interestingly, the following (legacy) url works: www.openml.org/data/view/5599

There are two differences: the amount of headers and the way the file is read. This will probably help us debug the problem.

mfeurer commented 7 years ago

The legacy URL does not compress the data:

feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget -d www.openml.org/data/view/5599 --header "Accept-Encoding: gzip, deflate, br" -t 2
Setting --header (header) to Accept-Encoding: gzip, deflate, br
Setting --tries (tries) to 2
DEBUG output created by Wget 1.17.1 on linux-gnu.

Reading HSTS entries from /home/feurerm/.wget-hsts
URI encoding = ‘UTF-8’
--2017-11-02 13:47:00--  http://www.openml.org/data/view/5599
Resolving www.openml.org (www.openml.org)... 131.155.11.58
Caching www.openml.org => 131.155.11.58
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:80... connected.
Created socket 3.
Releasing 0x0000560e1b0e87c0 (new refcount 1).

---request begin---
GET /data/view/5599 HTTP/1.1
User-Agent: Wget/1.17.1 (linux-gnu)
Accept: */*
Accept-Encoding: gzip, deflate, br
Host: www.openml.org
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 307 Temporary Redirect
Date: Thu, 02 Nov 2017 13:46:54 GMT
Server: Apache/2.4.18 (Ubuntu)
Location: https://www.openml.org/data/view/5599
Content-Length: 327
Keep-Alive: timeout=300, max=500
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

---response end---
307 Temporary Redirect
Registered socket 3 for persistent reuse.
URI content encoding = ‘iso-8859-1’
Location: https://www.openml.org/data/view/5599 [following]
Skipping 327 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>307 Temporary Redirect</title>
</head><body>
<h1>Temporary Redirect</h1>
<p>The document has moved <a href="https://www.openml.org/data/view/5599">here</a>.</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at www.openml.org Port 80</address>
</body></html>
] done.
URI content encoding = None
--2017-11-02 13:47:00--  https://www.openml.org/data/view/5599
Found www.openml.org in host_name_addresses_map (0x560e1b0e87c0)
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected.
Created socket 4.
Releasing 0x0000560e1b0e87c0 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 4 to SSL handle 0x0000560e1b108b90
certificate:
  subject: CN=openml.org,O=Technische Universiteit Eindhoven,L=Eindhoven,ST=Noord Brabant,C=NL
  issuer:  CN=TERENA SSL CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL
X509 certificate successfully verified and matches host www.openml.org

---request begin---
GET /data/view/5599 HTTP/1.1
User-Agent: Wget/1.17.1 (linux-gnu)
Accept: */*
Accept-Encoding: gzip, deflate, br
Host: www.openml.org
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK
Date: Thu, 02 Nov 2017 13:46:55 GMT
Server: Apache/2.4.18 (Ubuntu)
Set-Cookie: ci_session=9cb8704c3204dd3b120bee16e4d88c327d843433; expires=Thu, 02-Nov-2017 15:46:55 GMT; Max-Age=7200; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Content-Length: 677722132
Keep-Alive: timeout=300, max=500
Connection: Keep-Alive
Content-Type: application/octet-stream

---response end---
200 OK

Stored cookie www.openml.org -1 (ANY) / <permanent> <insecure> [expiry 2017-11-02 15:47:01] ci_session 9cb8704c3204dd3b120bee16e4d88c327d843433
Disabling further reuse of socket 3.
Closed fd 3
Registered socket 4 for persistent reuse.
Length: 677722132 (646M) [application/octet-stream]
Saving to: ‘5599’

5599                                                        100%[==========================================================================================================================================>] 646,33M  23,1MB/s    in 24s     

2017-11-02 13:47:34 (27,1 MB/s) - ‘5599’ saved [677722132/677722132]

Saving HSTS entries to /home/feurerm/.wget-hsts

real    0m33.951s
user    0m0.840s
sys 0m1.532s
joaquinvanschoren commented 7 years ago

@janvanrijn: what is the difference in "the way the file is read" ?

On Thu, 2 Nov 2017 at 13:48 Matthias Feurer notifications@github.com wrote:

The legacy URL does not compress the data:

feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget -d www.openml.org/data/view/5599 --header "Accept-Encoding: gzip, deflate, br" -t 2 Setting --header (header) to Accept-Encoding: gzip, deflate, br Setting --tries (tries) to 2 DEBUG output created by Wget 1.17.1 on linux-gnu.

Reading HSTS entries from /home/feurerm/.wget-hsts URI encoding = ‘UTF-8’ --2017-11-02 13:47:00-- http://www.openml.org/data/view/5599 Resolving www.openml.org (www.openml.org)... 131.155.11.58 Caching www.openml.org => 131.155.11.58 Connecting to www.openml.org (www.openml.org)|131.155.11.58|:80... connected. Created socket 3. Releasing 0x0000560e1b0e87c0 (new refcount 1).

---request begin--- GET /data/view/5599 HTTP/1.1 User-Agent: Wget/1.17.1 (linux-gnu) Accept: / Accept-Encoding: gzip, deflate, br Host: www.openml.org Connection: Keep-Alive

---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 307 Temporary Redirect Date: Thu, 02 Nov 2017 13:46:54 GMT Server: Apache/2.4.18 (Ubuntu) Location: https://www.openml.org/data/view/5599 Content-Length https://www.openml.org/data/view/5599Content-Length: 327 Keep-Alive: timeout=300, max=500 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1

---response end--- 307 Temporary Redirect Registered socket 3 for persistent reuse. URI content encoding = ‘iso-8859-1’ Location: https://www.openml.org/data/view/5599 [following] Skipping 327 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">307 Temporary Redirect

Temporary Redirect

The document has moved here.


Apache/2.4.18 (Ubuntu) Server at www.openml.org Port 80
] done. URI content encoding = None --2017-11-02 13:47:00-- https://www.openml.org/data/view/5599 Found www.openml.org in host_name_addresses_map (0x560e1b0e87c0) Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected. Created socket 4. Releasing 0x0000560e1b0e87c0 (new refcount 1). Initiating SSL handshake. Handshake successful; connected socket 4 to SSL handle 0x0000560e1b108b90 certificate: subject: CN=openml.org,O=Technische Universiteit Eindhoven,L=Eindhoven,ST=Noord Brabant,C=NL issuer: CN=TERENA SSL CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL X509 certificate successfully verified and matches host www.openml.org

---request begin--- GET /data/view/5599 HTTP/1.1 User-Agent: Wget/1.17.1 (linux-gnu) Accept: / Accept-Encoding: gzip, deflate, br Host: www.openml.org Connection: Keep-Alive

---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Date: Thu, 02 Nov 2017 13:46:55 GMT Server: Apache/2.4.18 (Ubuntu) Set-Cookie: ci_session=9cb8704c3204dd3b120bee16e4d88c327d843433; expires=Thu, 02-Nov-2017 15:46:55 GMT; Max-Age=7200; path=/; HttpOnly Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate Pragma: no-cache Content-Length: 677722132 Keep-Alive: timeout=300, max=500 Connection: Keep-Alive Content-Type: application/octet-stream

---response end--- 200 OK

Stored cookie www.openml.org -1 (ANY) / [expiry 2017-11-02 15:47:01] ci_session 9cb8704c3204dd3b120bee16e4d88c327d843433 Disabling further reuse of socket 3. Closed fd 3 Registered socket 4 for persistent reuse. Length: 677722132 (646M) [application/octet-stream] Saving to: ‘5599’

5599 100%[==========================================================================================================================================>] 646,33M 23,1MB/s in 24s

2017-11-02 13:47:34 (27,1 MB/s) - ‘5599’ saved [677722132/677722132]

Saving HSTS entries to /home/feurerm/.wget-hsts

real 0m33.951s user 0m0.840s sys 0m1.532s

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/476#issuecomment-341410579, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV2vBc_kq_kUoAc1LEeNzsbLzr33Yks5sybolgaJpZM4P0ycU .

-- Thank you, Joaquin

joaquinvanschoren commented 7 years ago

@mfeurer Interesting...

It seems the response is correct/incorrect depending on the compression that the client supports. Maybe there is confusion about the used compression, i.e. the client thinks that the server has compressed that data one way (e.g. 'deflate'), but it actually used another one (e.g. 'gzip'). This may have to do with cached versions of compressed files (or maybe not): https://stackoverflow.com/questions/7848796/what-does-varyaccept-encoding-mean

If I add a Content-Encoding header I can make that wget command work, but then the normal python API doesn't seem happy about it. It's also not a good idea to set it manually so I removed it again.

mfeurer commented 7 years ago

I assume it is actually taking to long to zip the file. Try time gzip BNG\(mfeat-fourier\).arff, which takes about 40 seconds (on my machine). From my understanding of mod_deflate, which I assume you use, you may want to change DeflateCompressionLevel to 1, which is probably set to 9 by default.

janvanrijn commented 7 years ago

A higher compression level seems to be preferred over a lower. There is an easy way of increasing the maximal request time in PHP, which was higher in the past (before we moved to the new server).

If currently requests time out after 30 seconds, we could try to double it.

mfeurer commented 7 years ago

It seems to be preferable at first thought, but I would actually argue that fast compression actually helps more on average, as most people can actually download more than 4MB/s. Also, the difference isn't very drastic. I tried zipping a half-downloaded file from OpenML:

-rw-r--r--  1 feurerm aad     295M Nov  3 10:55 BNG_mfeat-fourier.arff
-rw-r--r--  1 feurerm aad     127M Nov  3 10:56 BNG_mfeat-fourier.arff.2.gz
-rw-r--r--  1 feurerm aad     109M Nov  2 13:21 BNG_mfeat-fourier.arff.gz

The first file is the original file, the second file is compressed via gzip --fast and the third file using the defaul gzip setting. Using the fast option is about 4 times faster for gzip.

joaquinvanschoren commented 7 years ago

I have now set the compression level to 1 to try if it solves the issue. I tested wget and datasets.get_dataset(250) and this seems to work.

Could you also run further tests?

Looking into some published benchmarks, a lower level (1-3) is about 300-400% faster and results in files about 10% larger than the highest levels (7-9). Also, I do often see gzip taking significant cpu time on the server, so maybe not a bad idea to go for faster compression.

mfeurer commented 7 years ago

I can confirm that I can now download all the datasets. I will now go on to further checks.

mfeurer commented 7 years ago

Actually, being able to download all the datasets solves this issue for now, right?

mfeurer commented 7 years ago

I just checked dataset 1414 again, and the issue is actually not on the python side, but OpenML not knowing the data type in the first place. You can see that data type of the feature datetime is unknown, which I assume really shouldn't happen.

robotenique commented 5 years ago

So, is there a way to read the datasets which gives this error?

I'm running

openml.datasets.get_dataset(41895)

and still getting BadAttributeType: Bad @ATTRIBUTE type, at line 2. ...

mfeurer commented 5 years ago

Sorry, but the python API can't read date features yet. There are two ways forward:

  1. Take over PR https://github.com/renatopp/liac-arff/pull/67 and finish the feature to load the date data type
  2. Use R or Java to read the data