tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.38k stars 3.48k forks source link

cannot download babi data ("HTTP Error 403: Forbidden) #1206

Open rfernand2 opened 5 years ago

rfernand2 commented 5 years ago

Description

I consistently get "HTTP Error 403: Forbidden" error when trying to download the "babi" dataset. It happens when using either "t2t-datagen" or "t2t-trainer". Workaround provided at end of issue.

Environment information

OS: Windows 10, version 1809

$ pip freeze | grep tensor
mesh-tensorflow==0.0.3
tensor2tensor==1.10.0
tensorboard==1.11.0
tensorflow==1.10.0

$ python -V
Python 3.5.5 :: Anaconda custom (64-bit

For bugs: reproduction and error logs

# Steps to reproduce:
1. cd to where your tensor2tensor scripts are installed (in my case: C:\anaconda3\envs\tensorflow\Scripts)
2. python t2t-datagen  --problem=babi_qa_concat_task10_10k

Error logs:

(tensorflow) C:\anaconda3\envs\tensorflow\Scripts>python t2t-datagen  --problem=babi_qa_concat_task10_10k
c:\anaconda3\envs\tensorflow\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WARNING:tensorflow:It is strongly recommended to specify --data_dir. Data will be written to default data_dir=C:\Users\rfernand\AppData\Local\Temp.
INFO:tensorflow:Generating problems:
    babi:
      * babi_qa_concat_task10_10k
INFO:tensorflow:Generating data for babi_qa_concat_task10_10k.
INFO:tensorflow:Downloading http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz to /tmp/t2t_datagen\tasks_1-20_v1-2.tar.gz
Traceback (most recent call last):
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 215, in maybe_download
    tf.gfile.Copy(uri, filepath)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 397, in copy
    compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'http' not implemented (file: 'http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "t2t-datagen", line 28, in <module>
    tf.app.run()
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "t2t-datagen", line 23, in main
    t2t_datagen.main(argv)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\bin\t2t_datagen.py", line 198, in main
    generate_data_for_registered_problem(problem)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\bin\t2t_datagen.py", line 260, in generate_data_for_registered_problem
    problem.generate_data(data_dir, tmp_dir, task_id)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\text_problems.py", line 296, in generate_data
    self.generate_encoded_samples(data_dir, tmp_dir, split)), paths)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 155, in generate_files
    for case in generator:
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 383, in generate_encoded_samples
    generator = self.generate_samples(data_dir, tmp_dir, dataset_split)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 347, in generate_samples
    tmp_dir = _prepare_babi_data(tmp_dir, data_dir)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 126, in _prepare_babi_data
    file_path = generator_utils.maybe_download(tmp_dir, _TAR, _URL)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 220, in maybe_download
    uri, inprogress_filepath, reporthook=download_report_hook)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 472, in open
    response = meth(req, response)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 510, in error
    return self._call_chain(*args)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Issue Workaround

I found that the following code, replacing line 113 in the file "lib\site-packages\tensor2tensor\data_generators\babi_qa.py", fixed the problem:

# use agent signature of chrome to avoid "HTTP Error 403: Forbidden" errors on download on datasets like "babi"
  use_workaround = True
  if use_workaround:
    file_path = os.path.join(tmp_dir, _TAR)
    if not os.path.exists(file_path):
      import urllib
      opener=urllib.request.build_opener()
      opener.addheaders=[('User-Agent', \
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")]
      urllib.request.install_opener(opener)
      urllib.request.urlretrieve(_URL, file_path)
  else:
    file_path = generator_utils.maybe_download(tmp_dir, _TAR, _URL)
rfernand2 commented 5 years ago

Workaround V2

I noticed today that some downloads (like Multi NLI dataset from https://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip) require the following additional header (also used by Chrome):

('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8")

So, the "opener.addheaders=..." line in the workaround should be replaced with:

opener.addheaders = \
[
    ('User-Agent', \
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"),
    ('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")    
]
artitw commented 5 years ago

This has been fixed in the following pull request. https://github.com/tensorflow/tensor2tensor/pull/1235