rfernand2 commented 5 years ago

Description

I consistently get "HTTP Error 403: Forbidden" error when trying to download the "babi" dataset. It happens when using either "t2t-datagen" or "t2t-trainer". Workaround provided at end of issue.

Environment information

OS: Windows 10, version 1809

$ pip freeze | grep tensor
mesh-tensorflow==0.0.3
tensor2tensor==1.10.0
tensorboard==1.11.0
tensorflow==1.10.0

$ python -V
Python 3.5.5 :: Anaconda custom (64-bit

For bugs: reproduction and error logs

# Steps to reproduce:
1. cd to where your tensor2tensor scripts are installed (in my case: C:\anaconda3\envs\tensorflow\Scripts)
2. python t2t-datagen  --problem=babi_qa_concat_task10_10k

Error logs:

(tensorflow) C:\anaconda3\envs\tensorflow\Scripts>python t2t-datagen  --problem=babi_qa_concat_task10_10k
c:\anaconda3\envs\tensorflow\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WARNING:tensorflow:It is strongly recommended to specify --data_dir. Data will be written to default data_dir=C:\Users\rfernand\AppData\Local\Temp.
INFO:tensorflow:Generating problems:
    babi:
      * babi_qa_concat_task10_10k
INFO:tensorflow:Generating data for babi_qa_concat_task10_10k.
INFO:tensorflow:Downloading http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz to /tmp/t2t_datagen\tasks_1-20_v1-2.tar.gz
Traceback (most recent call last):
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 215, in maybe_download
    tf.gfile.Copy(uri, filepath)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 397, in copy
    compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'http' not implemented (file: 'http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "t2t-datagen", line 28, in <module>
    tf.app.run()
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "t2t-datagen", line 23, in main
    t2t_datagen.main(argv)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\bin\t2t_datagen.py", line 198, in main
    generate_data_for_registered_problem(problem)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\bin\t2t_datagen.py", line 260, in generate_data_for_registered_problem
    problem.generate_data(data_dir, tmp_dir, task_id)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\text_problems.py", line 296, in generate_data
    self.generate_encoded_samples(data_dir, tmp_dir, split)), paths)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 155, in generate_files
    for case in generator:
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 383, in generate_encoded_samples
    generator = self.generate_samples(data_dir, tmp_dir, dataset_split)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 347, in generate_samples
    tmp_dir = _prepare_babi_data(tmp_dir, data_dir)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\babi_qa.py", line 126, in _prepare_babi_data
    file_path = generator_utils.maybe_download(tmp_dir, _TAR, _URL)
  File "c:\anaconda3\envs\tensorflow\lib\site-packages\tensor2tensor\data_generators\generator_utils.py", line 220, in maybe_download
    uri, inprogress_filepath, reporthook=download_report_hook)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 472, in open
    response = meth(req, response)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 510, in error
    return self._call_chain(*args)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "c:\anaconda3\envs\tensorflow\lib\urllib\request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Issue Workaround

I found that the following code, replacing line 113 in the file "lib\site-packages\tensor2tensor\data_generators\babi_qa.py", fixed the problem:

# use agent signature of chrome to avoid "HTTP Error 403: Forbidden" errors on download on datasets like "babi"
  use_workaround = True
  if use_workaround:
    file_path = os.path.join(tmp_dir, _TAR)
    if not os.path.exists(file_path):
      import urllib
      opener=urllib.request.build_opener()
      opener.addheaders=[('User-Agent', \
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")]
      urllib.request.install_opener(opener)
      urllib.request.urlretrieve(_URL, file_path)
  else:
    file_path = generator_utils.maybe_download(tmp_dir, _TAR, _URL)

rfernand2 commented 5 years ago

Workaround V2

I noticed today that some downloads (like Multi NLI dataset from https://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip) require the following additional header (also used by Chrome):

('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8")

So, the "opener.addheaders=..." line in the workaround should be replaced with:

opener.addheaders = \
[
    ('User-Agent', \
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"),
    ('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")    
]

artitw commented 5 years ago

This has been fixed in the following pull request. https://github.com/tensorflow/tensor2tensor/pull/1235

tensorflow / tensor2tensor

cannot download babi data ("HTTP Error 403: Forbidden) #1206