tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.58k stars 3.51k forks source link

Wikisum generation fails with Python 3.7 #1793

Open dennlinger opened 4 years ago

dennlinger commented 4 years ago

Description

When trying to execute the CommonCrawl generator script (tensor2tensor/data_generators/wikisum/get_references_commoncrawl.py), I run into several issues that point towards compatibility problems with TF 2.1.0 and T2T 1.15.4 (see below).

I am aware of the more recent switch to import tensorflow.compat.v1 as tf and changed my script accordingly, but still run into issues. Generally, it seems to be a mismatch between str and binary-type objects. I get the same issue when I download the data from the GCP bucket to a local folder and run again with correct input arguments.

Note that fixing said error message alone actually makes the script run, but unfortunately still does not download any data, since the provided CommonCrawl paths are stored in a dictionary with a similar issue (existing entries havea str key, but the dict is queried with byte-like objects, resulting in no matches found.

I'm happy to provide a PR if this is indeed an issue that merits fixing, and would be thankful for any pointers on how to avoid this issue when running on Python 3 in general. Unfortunately I cannot execute the process on GCP as is suggested in the wikisum docs, and am planning to execute on my own cluster. ...

Environment information

OS: Ubuntu 18.04

$ pip freeze | grep tensor
mesh-tensorflow==0.1.10
tensor2tensor==1.15.4
tensorboard==2.1.0
tensorboardX==2.0
tensorflow-datasets==2.0.0
tensorflow-estimator==2.1.0
tensorflow-gan==2.0.0
tensorflow-gpu==2.1.0
tensorflow-hub==0.7.0
tensorflow-metadata==0.21.1
tensorflow-probability==0.7.0

$ python -V
Python 3.7.4

For bugs: reproduction and error logs

# Steps to reproduce:
python3 get_references_commoncrawl.py --num_tasks 1 --task_id 0 --out_dir /home/dennis/wikisum/out/
# Error logs:
File /home/.../tensor2tensor/data_generators/wikisum/utils.py", line 141, in wet_download_urls
    download_path = S3_HTTP_PREFIX + path[:-1]
TypeError: can only concatenate str (not "bytes") to str

Process finished with exit code 1
shahbazsyed commented 4 years ago

@dennlinger Any luck creating the dataset ?

dennlinger commented 4 years ago

Unfortunately, there seemed to be no interest, and the effort would be quite big. There are other repositories, though, that seem to offer a "direct download" option, see here, which is enough for my purposes.