monocongo / openimages

Tools for downloading images and annotations from Google's OpenImages dataset.
MIT License
47 stars 14 forks source link

Error when downloading from S3 #14

Open ark- opened 3 years ago

ark- commented 3 years ago

I'm getting the following exception when using

_download_images_by_id(image_ids_to_get, "train", destination_folder)

Where desination_folder is a path to a folder that exists and image_ids_to_get is a list of IDs, e.g. ["2a1d31d9e9bd6c85","2b8009fb25d3403e"]

If I start again where I left off (omitting IDs already downloaded) it will continue for another 100-900 images and fail again. If I continually re run my script all images will eventually download. So that means they all exist in S3.

Is S3 rate limiting? Has there been a breaking boto update?

  1%|▊                                                                             | 111/10069 [00:07<11:09, 14.88it/s]
Traceback (most recent call last):
  File "virtualenv\lib\site-packages\openimages\download.py", line 302, in _download_images_by_id
    list(tqdm(executor.map(_download_single_image, download_args_list),
  File "virtualenv\lib\site-packages\tqdm\std.py", line 1166, in __iter__
    for obj in iterable:
  File "c:\apps\python39\lib\concurrent\futures\_base.py", line 600, in result_iterator
    yield fs.pop().result()
  File "c:\apps\python39\lib\concurrent\futures\_base.py", line 433, in result
    return self.__get_result()
  File "c:\apps\python39\lib\concurrent\futures\_base.py", line 389, in __get_result
    raise self._exception
  File "c:\apps\python39\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "virtualenv\lib\site-packages\openimages\download.py", line 761, in _download_single_image
    arguments["s3_client"].download_fileobj(
  File "virtualenv\lib\site-packages\boto3\s3\inject.py", line 678, in download_fileobj
    return future.result()
  File "virtualenv\lib\site-packages\s3transfer\futures.py", line 106, in result
    return self._coordinator.result()
  File "virtualenv\lib\site-packages\s3transfer\futures.py", line 265, in result
    raise self._exception
  File "virtualenv\lib\site-packages\s3transfer\tasks.py", line 255, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "virtualenv\lib\site-packages\s3transfer\download.py", line 340, in _submit
    response = client.head_object(
  File "virtualenv\lib\site-packages\botocore\client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "virtualenv\lib\site-packages\botocore\client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
monocongo commented 3 years ago

Of late I don't use or maintain this package. If you can work out the issue and fix the code with a PR it'll be welcome. I apologize I can't offer more than that. Please let me know if this turns out to be a bug in this code rather than somewhere upstream such as the S3 throttling you mentioned above.