Open joaquinvanschoren opened 7 years ago
Update:
Datasets have string features: 374, 376, 379, 380, 373, 374, 376, 379, 380 Datasets with end-of-line comments behind attributes: 1074 Server error (The URI you submitted has disallowed characters.): 152, 153, 156, 157, 158, 159, 160 Could not download (connection broken): 250, 252, 254, 264, 271, 1183, 40517 Bad XML: 274 Processing error (unknown): 1597 No target feature: 4136, 4137, 4552, 1458, 1477, 1484, 1514, 1566 Invalid ARFF:
The code used for this (and check later):
dids = set()
for tid, task in tasks.items():
dids.add(task['did'])
for did in dids:
try:
ds = openml.datasets.get_dataset(did)
X, y = ds.get_data(target=ds.default_target_attribute)
except Exception as e:
print("error: ", did, sys.exc_info()[0], e)
@joaquinvanschoren We have a list of those in the python tracker: https://github.com/openml/openml-python/issues/310
Proposed actions:
Also: 1092, 1597
380 has more issues than just string features, same for 379 and the others around there.
@mfeurer just wanted to report that there are no more data quality processing errors for datasets 190, 372 and 1396 ;)
Update:
Still need to fix the 6 badly formatted datasets and check the 'connection broken' issue.
1037, 1039, 1042, 1074 are fixed 4675, 4709 remain 'in_preparation'. They seem like valid arff but have a 'date' attribute that doesn't seem to parse.
About the 'connection broken' issue, this seems tricky. I'm not sure whether this is a server issue or a python issue. I can download the dataset fine with wget and curl. E.g.:
wget https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff
This fails:
import requests
url = 'https://www.openml.org/data/download/5599'
r = requests.get(url)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)'
I tried with different http headers, but the issue remains.
it could be that wget and curl just ignore the error. There might be a way to ignore it with requests, too.
When I used the /data/download
interface, I usually got an incomplete error... using the actual file name fixed it. It looks like you're using the filename with wget and the incomplete url with requests. Is that incomplete URL a supported API or not? I asked that before, but it's not clear to me.
Interesting, can you give an example? I tried several datasets but it seems to work fine? As far as I know, using the filename or not shouldn't make a difference. @janvanrijn can explain better.
Using the full filename also doesn't help with requests:
url = 'https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff'
r = requests.get(url)
...
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)',
I checked whether it has something to do with the file size, but bigger datasets, like KDDCup99_full (https://www.openml.org/data/download/53993), do download without problems.
Maybe it is something specific about these files...
Small update. I ran another test trying to download all datasets. Remaining issues:
These were fixed straight away:
I just tried to figure out where the ChunkedEncodingError
could come from. One possibility is that the server kills the connection after ~30 seconds, you can verify this behaviour by pasting this piece of code:
def _read_url(url, data=None):
data = {} if data is None else data
if config.apikey is not None:
data['api_key'] = config.apikey
if len(data) == 0 or (len(data) == 1 and 'api_key' in data):
# do a GET
response = requests.get(url, params=data, stream=True)
else: # an actual post request
# Using requests.post sets header 'Accept-encoding' automatically to
# 'gzip,deflate'
response = requests.post(url, data=data, stream=True)
if '.arff' in url:
import time
st = time.time()
iteration = 0
for c in response.iter_content(10 * 1024):
print(time.time() - st, iteration, c)
iteration += 1
if response.status_code != 200:
raise _parse_server_exception(response, url=url)
if 'Content-Encoding' not in response.headers or \
response.headers['Content-Encoding'] != 'gzip':
warnings.warn('Received uncompressed content from OpenML for %s.' % url)
return response.text
into the file _api_calls.py
, replacing the current _read_url
function. It will not get to 30s.
I assume wget
works because it is faster at downloading the file (takes only 12s on my machine).
I can also download the file in the browser, but then I only get half of the file, i.e. ~550000 lines instead of 1M lines.
Dataset 1414 contains date attributes, these are not supported at the moment.
I got an incomplete read when trying to do iris, so that was unrelated to the timing, I think.
Which iris ID? And is it reproducible?
Yes. And any. basically any dataset, when using the https://www.openml.org/data/download/<DATASETID>
because (in my limited understanding) that's not actually part of the rest API.
Note that this API expects a file id, not a dataset id: https://www.openml.org/data/download/<FILE_ID>
It serves everything from datasets to uploaded runs, and the file ID is given by the 'normal' APIs, eg. /api/v1/data/<DATASET_ID>
For the first batch of datasets, the file_id is the same as the dataset_id, so iris
can be downloaded by https://www.openml.org/data/download/61
.
This works just fine for me. How can I reproduce your error?
If it helps, this is what I did trying to reproduce the incomplete read:
>>> import requests
>>> requests.get('https://www.openml.org/data/download/61')
<Response [200]>
@mfeurer Do you give a 'not supported' warning in the python API when a dataset has a 'date' attribute? They can remain active, right?
hm now that one works. Try
from urllib.request import urlopen
result = urlopen('https://www.openml.org/data/download/4550')
result.read()
requests swallows the incomplete read so I'm using urllib.
I set all timeouts to 300 seconds now. I also updated some compression settings.
Let me know if it helps, I didn't have time to run your code yet.
@amueller File 4550 is not actually a dataset, it's an uploaded run. I fixed the problem: the filelength was stored incorrectly in our database, hence the incorrect header. I'll do a check whether there are more of these errors with the run uploads.
If you want dataset 4550 (MiceProtein), that's file 1804243, as noted in the description: https://www.openml.org/api/v1/data/4550
Interestingly, this seems to work fine:
from urllib.request import urlopen
result = urlopen('https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff')
result.read()
While this still fails:
ds = openml.datasets.get_dataset(250)
Hm, I also have to change my initial theory as the following works and takes more than 30 seconds, while still in python and the browser the download stops after 30 seconds:
feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget --limit-rate=10m https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff
--2017-11-02 13:29:47-- https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff
Resolving www.openml.org (www.openml.org)... 131.155.11.58
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 677722132 (646M) [text/plain]
Saving to: ‘BNG_mfeat-fourier.arff.1’
BNG_mfeat-fourier.arff.1 100%[==========================================================================================================================================>] 646,33M 10,4MB/s in 70s
2017-11-02 13:30:58 (9,25 MB/s) - ‘BNG_mfeat-fourier.arff.1’ saved [677722132/677722132]
real 1m10.973s
user 0m0.872s
sys 0m1.584s
Okay, I now copied the headers send by my browser one by one and this one reproduces the failure:
time wget -d https://www.openml.org/data/download/5599/BNG_mfeat-fourier.arff --header "Accept-Encoding: gzip, deflate, br"
Edit: adding -t 2
reduces the amount of retries to 2 and the code fails reproducible within a minute. This explains why urllib
works. So basically if your internet connection is fast enough you don't want the files to be compressed so that you can actually download them, but if your internet connection is 3.6 MB/s or less (actually what I have at home;)) you want compression.
Interestingly, the following (legacy) url works: www.openml.org/data/view/5599
There are two differences: the amount of headers and the way the file is read. This will probably help us debug the problem.
The legacy URL does not compress the data:
feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget -d www.openml.org/data/view/5599 --header "Accept-Encoding: gzip, deflate, br" -t 2
Setting --header (header) to Accept-Encoding: gzip, deflate, br
Setting --tries (tries) to 2
DEBUG output created by Wget 1.17.1 on linux-gnu.
Reading HSTS entries from /home/feurerm/.wget-hsts
URI encoding = ‘UTF-8’
--2017-11-02 13:47:00-- http://www.openml.org/data/view/5599
Resolving www.openml.org (www.openml.org)... 131.155.11.58
Caching www.openml.org => 131.155.11.58
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:80... connected.
Created socket 3.
Releasing 0x0000560e1b0e87c0 (new refcount 1).
---request begin---
GET /data/view/5599 HTTP/1.1
User-Agent: Wget/1.17.1 (linux-gnu)
Accept: */*
Accept-Encoding: gzip, deflate, br
Host: www.openml.org
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 307 Temporary Redirect
Date: Thu, 02 Nov 2017 13:46:54 GMT
Server: Apache/2.4.18 (Ubuntu)
Location: https://www.openml.org/data/view/5599
Content-Length: 327
Keep-Alive: timeout=300, max=500
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
---response end---
307 Temporary Redirect
Registered socket 3 for persistent reuse.
URI content encoding = ‘iso-8859-1’
Location: https://www.openml.org/data/view/5599 [following]
Skipping 327 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>307 Temporary Redirect</title>
</head><body>
<h1>Temporary Redirect</h1>
<p>The document has moved <a href="https://www.openml.org/data/view/5599">here</a>.</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at www.openml.org Port 80</address>
</body></html>
] done.
URI content encoding = None
--2017-11-02 13:47:00-- https://www.openml.org/data/view/5599
Found www.openml.org in host_name_addresses_map (0x560e1b0e87c0)
Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected.
Created socket 4.
Releasing 0x0000560e1b0e87c0 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 4 to SSL handle 0x0000560e1b108b90
certificate:
subject: CN=openml.org,O=Technische Universiteit Eindhoven,L=Eindhoven,ST=Noord Brabant,C=NL
issuer: CN=TERENA SSL CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL
X509 certificate successfully verified and matches host www.openml.org
---request begin---
GET /data/view/5599 HTTP/1.1
User-Agent: Wget/1.17.1 (linux-gnu)
Accept: */*
Accept-Encoding: gzip, deflate, br
Host: www.openml.org
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Thu, 02 Nov 2017 13:46:55 GMT
Server: Apache/2.4.18 (Ubuntu)
Set-Cookie: ci_session=9cb8704c3204dd3b120bee16e4d88c327d843433; expires=Thu, 02-Nov-2017 15:46:55 GMT; Max-Age=7200; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Content-Length: 677722132
Keep-Alive: timeout=300, max=500
Connection: Keep-Alive
Content-Type: application/octet-stream
---response end---
200 OK
Stored cookie www.openml.org -1 (ANY) / <permanent> <insecure> [expiry 2017-11-02 15:47:01] ci_session 9cb8704c3204dd3b120bee16e4d88c327d843433
Disabling further reuse of socket 3.
Closed fd 3
Registered socket 4 for persistent reuse.
Length: 677722132 (646M) [application/octet-stream]
Saving to: ‘5599’
5599 100%[==========================================================================================================================================>] 646,33M 23,1MB/s in 24s
2017-11-02 13:47:34 (27,1 MB/s) - ‘5599’ saved [677722132/677722132]
Saving HSTS entries to /home/feurerm/.wget-hsts
real 0m33.951s
user 0m0.840s
sys 0m1.532s
@janvanrijn: what is the difference in "the way the file is read" ?
On Thu, 2 Nov 2017 at 13:48 Matthias Feurer notifications@github.com wrote:
The legacy URL does not compress the data:
feurerm@aadpool4:~/projects/openml/openml-serverdata-quality-bot$ time wget -d www.openml.org/data/view/5599 --header "Accept-Encoding: gzip, deflate, br" -t 2 Setting --header (header) to Accept-Encoding: gzip, deflate, br Setting --tries (tries) to 2 DEBUG output created by Wget 1.17.1 on linux-gnu.
Reading HSTS entries from /home/feurerm/.wget-hsts URI encoding = ‘UTF-8’ --2017-11-02 13:47:00-- http://www.openml.org/data/view/5599 Resolving www.openml.org (www.openml.org)... 131.155.11.58 Caching www.openml.org => 131.155.11.58 Connecting to www.openml.org (www.openml.org)|131.155.11.58|:80... connected. Created socket 3. Releasing 0x0000560e1b0e87c0 (new refcount 1).
---request begin--- GET /data/view/5599 HTTP/1.1 User-Agent: Wget/1.17.1 (linux-gnu) Accept: / Accept-Encoding: gzip, deflate, br Host: www.openml.org Connection: Keep-Alive
---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 307 Temporary Redirect Date: Thu, 02 Nov 2017 13:46:54 GMT Server: Apache/2.4.18 (Ubuntu) Location: https://www.openml.org/data/view/5599 Content-Length https://www.openml.org/data/view/5599Content-Length: 327 Keep-Alive: timeout=300, max=500 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1
---response end--- 307 Temporary Redirect Registered socket 3 for persistent reuse. URI content encoding = ‘iso-8859-1’ Location: https://www.openml.org/data/view/5599 [following] Skipping 327 bytes of body: [<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
307 Temporary Redirect Temporary Redirect
The document has moved here.
Apache/2.4.18 (Ubuntu) Server at www.openml.org Port 80 ] done. URI content encoding = None --2017-11-02 13:47:00-- https://www.openml.org/data/view/5599 Found www.openml.org in host_name_addresses_map (0x560e1b0e87c0) Connecting to www.openml.org (www.openml.org)|131.155.11.58|:443... connected. Created socket 4. Releasing 0x0000560e1b0e87c0 (new refcount 1). Initiating SSL handshake. Handshake successful; connected socket 4 to SSL handle 0x0000560e1b108b90 certificate: subject: CN=openml.org,O=Technische Universiteit Eindhoven,L=Eindhoven,ST=Noord Brabant,C=NL issuer: CN=TERENA SSL CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL X509 certificate successfully verified and matches host www.openml.org---request begin--- GET /data/view/5599 HTTP/1.1 User-Agent: Wget/1.17.1 (linux-gnu) Accept: / Accept-Encoding: gzip, deflate, br Host: www.openml.org Connection: Keep-Alive
---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Date: Thu, 02 Nov 2017 13:46:55 GMT Server: Apache/2.4.18 (Ubuntu) Set-Cookie: ci_session=9cb8704c3204dd3b120bee16e4d88c327d843433; expires=Thu, 02-Nov-2017 15:46:55 GMT; Max-Age=7200; path=/; HttpOnly Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate Pragma: no-cache Content-Length: 677722132 Keep-Alive: timeout=300, max=500 Connection: Keep-Alive Content-Type: application/octet-stream
---response end--- 200 OK
Stored cookie www.openml.org -1 (ANY) /
[expiry 2017-11-02 15:47:01] ci_session 9cb8704c3204dd3b120bee16e4d88c327d843433 Disabling further reuse of socket 3. Closed fd 3 Registered socket 4 for persistent reuse. Length: 677722132 (646M) [application/octet-stream] Saving to: ‘5599’ 5599 100%[==========================================================================================================================================>] 646,33M 23,1MB/s in 24s
2017-11-02 13:47:34 (27,1 MB/s) - ‘5599’ saved [677722132/677722132]
Saving HSTS entries to /home/feurerm/.wget-hsts
real 0m33.951s user 0m0.840s sys 0m1.532s
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/476#issuecomment-341410579, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV2vBc_kq_kUoAc1LEeNzsbLzr33Yks5sybolgaJpZM4P0ycU .
-- Thank you, Joaquin
@mfeurer Interesting...
It seems the response is correct/incorrect depending on the compression that the client supports. Maybe there is confusion about the used compression, i.e. the client thinks that the server has compressed that data one way (e.g. 'deflate'), but it actually used another one (e.g. 'gzip'). This may have to do with cached versions of compressed files (or maybe not): https://stackoverflow.com/questions/7848796/what-does-varyaccept-encoding-mean
If I add a Content-Encoding header I can make that wget command work, but then the normal python API doesn't seem happy about it. It's also not a good idea to set it manually so I removed it again.
I assume it is actually taking to long to zip the file. Try time gzip BNG\(mfeat-fourier\).arff
, which takes about 40 seconds (on my machine). From my understanding of mod_deflate, which I assume you use, you may want to change DeflateCompressionLevel
to 1, which is probably set to 9 by default.
A higher compression level seems to be preferred over a lower. There is an easy way of increasing the maximal request time in PHP, which was higher in the past (before we moved to the new server).
If currently requests time out after 30 seconds, we could try to double it.
It seems to be preferable at first thought, but I would actually argue that fast compression actually helps more on average, as most people can actually download more than 4MB/s. Also, the difference isn't very drastic. I tried zipping a half-downloaded file from OpenML:
-rw-r--r-- 1 feurerm aad 295M Nov 3 10:55 BNG_mfeat-fourier.arff
-rw-r--r-- 1 feurerm aad 127M Nov 3 10:56 BNG_mfeat-fourier.arff.2.gz
-rw-r--r-- 1 feurerm aad 109M Nov 2 13:21 BNG_mfeat-fourier.arff.gz
The first file is the original file, the second file is compressed via gzip --fast
and the third file using the defaul gzip
setting. Using the fast option is about 4 times faster for gzip
.
I have now set the compression level to 1 to try if it solves the issue. I tested wget and datasets.get_dataset(250) and this seems to work.
Could you also run further tests?
Looking into some published benchmarks, a lower level (1-3) is about 300-400% faster and results in files about 10% larger than the highest levels (7-9). Also, I do often see gzip taking significant cpu time on the server, so maybe not a bad idea to go for faster compression.
I can confirm that I can now download all the datasets. I will now go on to further checks.
Actually, being able to download all the datasets solves this issue for now, right?
I just checked dataset 1414 again, and the issue is actually not on the python side, but OpenML not knowing the data type in the first place. You can see that data type of the feature datetime
is unknown, which I assume really shouldn't happen.
So, is there a way to read the datasets which gives this error?
I'm running
openml.datasets.get_dataset(41895)
and still getting BadAttributeType: Bad @ATTRIBUTE type, at line 2.
...
Sorry, but the python API can't read date features yet. There are two ways forward:
I'm running a script trying to download and parse all active datasets with Python. So far, I got these errors: Datasets have string features: 374, 376, 379, 380 Datasets with end-of-line comments behind attributes: 1074
I'll report back if more show up.