impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

[rebuild] process fails when issue has empty pages #62

Open mromanello opened 3 years ago

mromanello commented 3 years ago

Example: oecaen-1914-12-02-a from BNF data.

Extent:

~18 issues of oecaen (as of 01-09-2020).

Complete log

Uploading 8 rebuilt bz2files to canonical-rebuilt-testing
Processing batch 9/11 [{'oecaen': [1912, 1943]}]% Completed | 22.3s
Processing year 1912
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 97
Skipped articles: []
done.
Processing year 1913
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 117
Skipped articles: []
done.
Processing year 1914
Retrieving issues...
Fleshing out articles by issue...
Number of partitions: 117
  File "impresso_commons/text/rebuilder.py", line 703, in main
    filter_language=languages
  File "impresso_commons/text/rebuilder.py", line 541, in rebuild_issues
    .pluck('id')\
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 2510, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 1812, in gather
    asynchronous=asynchronous,
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 753, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/utils.py", line 337, in sync
    six.reraise(*error[0])
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/utils.py", line 322, in f
    result[0] = yield future
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/distributed/client.py", line 1668, in _gather
    six.reraise(type(exception), exception, traceback)
  File "/home/romanell/.pyenv/versions/impresso-pycommons/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/impresso_commons/text/helpers.py", line 70, in read_issue_pages
    for page in alternative_read_text(filename, IMPRESSO_STORAGEOPT)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/impresso_commons/utils/s3.py", line 443, in alternative_read_text
    with s_open(s3_key, 'r', transport_params=transport_params) as infile:
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 348, in open
    binary, filename = _open_binary_stream(uri, binary_mode, transport_params)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 556, in _open_binary_stream
    return _s3_open_uri(parsed_uri, mode, transport_params), filename
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 628, in _s3_open_uri
    return smart_open_s3.open(parsed_uri.bucket_id, parsed_uri.key_id, mode, **kwargs)
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/s3.py", line 117, in open
    resource_kwargs=resource_kwargs,
  File "/home/romanell/.pyenv/versions/3.6.0/envs/impresso-pycommons/lib/python3.6/site-packages/smart_open/s3.py", line 345, in __init__
    'or is forbidden for access' % (key, bucket)
'oecaen/pages/oecaen-1914/oecaen-1914-12-02-a-pages.jsonl.bz2' does not exist in the bucket 'original-canonical-staging', or is forbidden for access