royorel / FFHQ-Aging-Dataset

FFHQ-Aging Dataset
Other
260 stars 25 forks source link

Not all images are downloaded! #2

Closed farhodfm closed 3 years ago

farhodfm commented 3 years ago

Hi @royorel!

Thanks for sharing the full implementation and data!

I downloaded data by following your instructions (using PyDrive). But then, I realized that not all images are downloaded. There are few images in each subfolder ('00000' ~ '69000'). I am sorry if I missed some points. Could you give instructions to download all images, please?

Thanks once again

royorel commented 3 years ago

Hi @farhodfm

The only reason that can happen with the code is if you use the --debug flag. However, I assume that's not the case here.

In order to figure out what happened, can you please elaborate a little bit more?

  1. Did you get any error messages?
  2. How many images were downloaded overall?
  3. The script takes a while to run, so in case you were running it on a remote server, did your connection time out by any chance?
farhodfm commented 3 years ago

Thank you for replying, @royorel

You are right! The problem is not related to --debug as by default it set to be False.

  1. Did you get any error messages?

As I remember (I am sorry, I cleaned up the overall process), there was one error stating Too many open files. Somehow, I did not pay attention to that as the deeplab model was working just fine.

  1. How many images were downloaded overall?

Every subfolder contains a different number of images, some of them have less than 20. Overall in 70 subfolders of ffhq_aging256x256 folder, there are more than 700 images.

  1. The script takes a while to run, so in case you were running it on a remote server, did your connection time out by any chance?

I think, there is no problem with time out since I downloaded the dataset twice. In both cases, I could not get a full dataset.

I just tried to call get_ffhq_aging.sh in order to get a full running log, but got error 403. So, I will try to call it tomorrow.

But, if you have any considerations, I will be glad to discuss it with you.

royorel commented 3 years ago

@farhodfm Can you recreate the error message and share it here? that would give me a hint at what went wrong...

farhodfm commented 3 years ago

@royorel, I tried to download it again.

Below, I attached the file where you can see error. log.txt

royorel commented 3 years ago

It seems like what's happening is an error from the multithreading python library. Looks like your machine has a limit on the number of files that can be open in parallel.

two possible solutions are:

  1. increase the maximum number of open files, see: https://stackoverflow.com/questions/39537731/errno-24-too-many-open-files-but-i-am-not-opening-files
  2. Decrease the number of threads in the call to download_ffhq_aging.py: In get_ffhq_aging.sh, Add the --num_threads flag to the download_ffhq_aging.py call with the number of open threads (the default is 32, so you should probably try a lower number).

Please, let me know if this helps

farhodfm commented 3 years ago

@royorel

  1. I increased the maximum number of open files from 1024 to 4096.
  2. I decreased the number of threads from 32 to 8.

However, the following message is presented.

Authentication successful.
authorized access to google drive API!
Downloading JSON metadata...
\ done processing 1/2 filesTraceback (most recent call last):
  File "download_ffhq_aging.py", line 374, in <module>
    run_cmdline(sys.argv)
  File "download_ffhq_aging.py", line 369, in run_cmdline
    run(**vars(args))
  File "download_ffhq_aging.py", line 333, in run
    download_files([json_spec, license_specs['json']], drive=drive, **download_kwargs)
  File "download_ffhq_aging.py", line 209, in download_files
    raise exc_info[1].with_traceback(exc_info[2])
  File "download_ffhq_aging.py", line 219, in _download_thread
    pydrive_utils.pydrive_download(drive, spec['file_url'], spec['file_path'])
  File "/home/farhod/Documents/FFHQ-Aging-Dataset/pydrive_utils.py", line 40, in pydrive_download
    pydrive_file.GetContentFile(save_path)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 210, in GetContentFile
    self.FetchContent(mimetype, remove_bom)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 43, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 255, in FetchContent
    self.content = io.BytesIO(self._DownloadFromUrl(download_url))
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/auth.py", line 75, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 505, in _DownloadFromUrl
    raise ApiRequestError('Cannot download file: %s' % resp)
pydrive.files.ApiRequestError: Cannot download file: {'x-guploader-uploadid': 'ABg5-UxUB1kGn4T5glfP_f-xQ__oNXit0o15UJMVoEq1UFND3ct_skVuSlU8jJfgD4F_kLB60xAHyYYgsWFDawGsLNs', 'vary': 'Origin, X-Origin', 'content-type': 'application/json; charset=UTF-8', 'date': 'Fri, 09 Oct 2020 16:37:28 GMT', 'expires': 'Fri, 09 Oct 2020 16:37:28 GMT', 'cache-control': 'private, max-age=0', 'content-length': '320', 'server': 'UploadServer', 'alt-svc': 'h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', 'status': '403'}
Traceback (most recent call last):
  File "run_deeplab.py", line 91, in <module>
    main()
  File "run_deeplab.py", line 44, in main
    assert os.path.isdir(dataset_root)
AssertionError
royorel commented 3 years ago

It seems like you got a 403 error once again (see info in the line that starts with pydrive.files.ApiRequestError).

This is an error coming from the google drive API. Right now it seems like the quota for downloading the Json file was exceeded. I just tried to download the file manually from the Google drive web interface and got the same error. image

In that case, the PyDrive interface won't work either, and the only solution is to wait.

PyDrive is useful when you get an error from the regular script but you're able to download the files manually from the google drive web interface.

farhodfm commented 3 years ago

I gave another try and this is the result.

Authentication successful.
authorized access to google drive API!
Downloading JSON metadata...
/ done processing 2/2 files
Parsing JSON metadata...
Downloading 70001 files...
- done processing 3894/70001 filesTraceback (most recent call last):
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 237, in FetchMetadata
    .execute(http=self.http)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 500 when requesting https://www.googleapis.com/drive/v2/files/1aMCLSu17QL1K50o6RepCu3udocQKwD6a?alt=json returned "Internal Error">

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_ffhq_aging.py", line 374, in <module>
    run_cmdline(sys.argv)
  File "download_ffhq_aging.py", line 369, in run_cmdline
    run(**vars(args))
  File "download_ffhq_aging.py", line 348, in run
    download_files(specs, dst_dir, output_size, drive=drive, **download_kwargs)
  File "download_ffhq_aging.py", line 209, in download_files
    raise exc_info[1].with_traceback(exc_info[2])
  File "download_ffhq_aging.py", line 219, in _download_thread
    pydrive_utils.pydrive_download(drive, spec['file_url'], spec['file_path'])
  File "/home/farhod/Documents/FFHQ-Aging-Dataset/pydrive_utils.py", line 40, in pydrive_download
    pydrive_file.GetContentFile(save_path)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 210, in GetContentFile
    self.FetchContent(mimetype, remove_bom)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 42, in _decorated
    self.FetchMetadata()
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/auth.py", line 75, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/home/farhod/anaconda3/envs/pytorch/lib/python3.6/site-packages/pydrive/files.py", line 239, in FetchMetadata
    raise ApiRequestError(error)
pydrive.files.ApiRequestError: <HttpError 500 when requesting https://www.googleapis.com/drive/v2/files/1aMCLSu17QL1K50o6RepCu3udocQKwD6a?alt=json returned "Internal Error">
processed 1/3894 images
processed 2/3894 images
processed 3/3894 images
processed 4/3894 images
processed 5/3894 images
..........................
processed 3890/3894 images
processed 3891/3894 images
processed 3892/3894 images
processed 3893/3894 images
processed 3894/3894 images

PyDrive is useful when you get an error from the regular script but you're able to download the files manually from the google drive web interface.

Do you mean to download original FFHQ and then modify download_ffhq_aging.py (no downloading, but further pre-processing)?

royorel commented 3 years ago

@farhodfm, I googled "pydrive.files.ApiRequestError: <HttpError 500 when requesting" results seem to indicate that this is some sort of an error in the servers that host the files. It has nothing to do with the download code, I think you should retry downloading and see if the problem persists, I suspect that was something temporary. The downloading script is designed to continue downloading from the place it was stopped.

What I meant is that PyDrive emulates whatever you get when using the web interface. If you get a quota exceeded error (error 403) in the web interface, you will also get it with PyDrive. If you can download the file manually, you will also be able to download it with PyDrive.

That is not the case when using the default setup without pydrive, as quota exceeded errors are returned more often

farhodfm commented 3 years ago

@royorel

I continued the downloading from the stopped point. I downloaded 377 more images, and again the same problem. I thought I can keep continuing, but "Quota Exceeded Error - 403" is always there. I know this problem is not related to your code and only way is to wait. So, thank you for helping.