Open anjakefala opened 2 years ago
anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ time make wikidata
OUTDIR=output/wikidata scripts/wikidata.sh
[6041.3s] 1180688KilledMB (0.18 MB/s) latest-all.json.bz2
Traceback (most recent call last):
File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:26: wikidata] Error 137
make wikidata 2253.43s user 305.53s system 42% cpu 1:40:47.28 total
New url: # https://geonames.nga.mil/geonames/GNSData/fc_files/Whole_World.7z
anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ scripts/geonames-nonus.py -o output
Traceback (most recent call last):
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
conn.connect()
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connection.py", line 414, in connect
self.sock = ssl_wrap_socket(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
ssl_sock = _ssl_wrap_socket_impl(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1040, in _create
self.do_handshake()
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/anja/git/readysetdata/scripts/geonames-nonus.py", line 31, in <module>
} for r in parse_asv(unzip_url(URL).open_text('Countries.txt'))))
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 101, in open_text
return io.TextIOWrapper(io.BufferedReader(self.open(fn)))
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 81, in open
f = list(self.matching_files(fn))
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 75, in matching_files
for f in self.files.values():
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in files
return {r.filename:r for r in self.infolist()}
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in <dictcomp>
return {r.filename:r for r in self.infolist()}
File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 44, in infolist
resp = self.http.request('HEAD', self.url)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 74, in request
return self.request_encode_url(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
return self.urlopen(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
return self.urlopen(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
return self.urlopen(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
retries = retries.increment(
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='geonames.nga.mil', port=443): Max retries exceeded with url: /gns/html/cntyfile/geonames_20220606.zip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))
make movielens
(It successfully completes, but has this one exception near the end)
453s 6.77/125.89MB (0.01 MB/s) movie_dataset_public_final/raw/ratings.json
Traceback (most recent call last):
File "/home/anja/git/readysetdata/readysetdata/output.py", line 24, in output
r = next(it)
File "/home/anja/git/readysetdata/scripts/movielens.py", line 48, in <genexpr>
output('movielens', 'ratings', ({
File "/home/anja/git/readysetdata/readysetdata/utils.py", line 147, in __iter__
yield AttrDict(json.loads(line))
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 28 (char 27)
None
0s 0.00/0.36MB (0.00 MB/s) movie_dataset_public_final/raw/survey_answers.json
[12.0s] 42100
12s 0.26/0.36MB (0.02 MB/s) movie_dataset_public_final/raw/survey_answers.json
[16.5s] 58500
17s 0.36/0.36MB (0.02 MB/s) movie_dataset_public_final/raw/survey_answers.json
[16.6s] 58900
17s 0.36/0.36MB (0.02 MB/s) movie_dataset_public_final/raw/survey_answers.json
17s 0.36/0.36MB (0.02 MB/s) movie_dataset_public_final/raw/survey_answers.json
make wikipedia
File "/home/anja/git/readysetdata/scripts/parse-wikipedia.py", line 15, in <module>
File "/home/anja/git/readysetdata/readysetdata/output.py", line 16, in outputSingle
File "/home/anja/git/readysetdata/readysetdata/output.py", line 98, in output
File "/home/anja/git/readysetdata/readysetdata/output.py", line 99, in <listcomp>
File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 29, in output_jsonl
File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 9, in __init__
OSError: [Errno 24] Too many open files: 'output/wikipedia_infoboxes/hot_spring.jsonl'
Traceback (most recent call last):
File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
main()
File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
rdr.parse(sys.stdin)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 125, in parse
self.feed(buffer)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
File "/opt/conda/conda-bld/python-split_1654083059479/work/Modules/pyexpat.c", line 461, in EndElement
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 336, in end_element
self._cont_handler.endElement(name)
File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 44, in endElement
print(json.dumps(simplify(contents)), file=self.fp)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:21: wikipedia] Error 1
make wikipedia 3230.65s user 17.81s system 106% cpu 50:46.76 total
make wikipedia
3393s 482.54/21132.09MB (0.14 MB/s) enwiki-latest-pages-articles-multistream.xml.bz2
bunzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bunzip2: Inappropriate ioctl for device
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[3392.4s] 66704Traceback (most recent call last):
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: no element found: line 13647185, column 1107
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
main()
File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
rdr.parse(sys.stdin)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 127, in parse
self.close()
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 240, in close
self.feed(b"", isFinal=True)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 221, in feed
self._err_handler.fatalError(exc)
File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <stdin>:13647185:1107: no element found
cd output/wikipedia-infoboxes && zip -n .arrow ../wikipedia-infoboxes.zip *.jsonl
/bin/sh: 1: cd: can't cd to output/wikipedia-infoboxes
make: *** [Makefile:22: wikipedia] Error 2
Fixed
title.principles.tsv.gz
seems to have been momentarily corrupted. Made a PR with a try/except added, so at least the other tables would get built: https://github.com/saulpw/readysetdata/pull/10Edit: title.principals.tsv.gz unzipped fine with
gzip
.