Open dennlinger opened 4 years ago
@dennlinger Any luck creating the dataset ?
Unfortunately, there seemed to be no interest, and the effort would be quite big. There are other repositories, though, that seem to offer a "direct download" option, see here, which is enough for my purposes.
Description
When trying to execute the CommonCrawl generator script (
tensor2tensor/data_generators/wikisum/get_references_commoncrawl.py
), I run into several issues that point towards compatibility problems with TF 2.1.0 and T2T 1.15.4 (see below).I am aware of the more recent switch to
import tensorflow.compat.v1 as tf
and changed my script accordingly, but still run into issues. Generally, it seems to be a mismatch betweenstr
and binary-type objects. I get the same issue when I download the data from the GCP bucket to a local folder and run again with correct input arguments.Note that fixing said error message alone actually makes the script run, but unfortunately still does not download any data, since the provided CommonCrawl paths are stored in a dictionary with a similar issue (existing entries havea
str
key, but the dict is queried with byte-like objects, resulting in no matches found.I'm happy to provide a PR if this is indeed an issue that merits fixing, and would be thankful for any pointers on how to avoid this issue when running on Python 3 in general. Unfortunately I cannot execute the process on GCP as is suggested in the wikisum docs, and am planning to execute on my own cluster. ...
Environment information
For bugs: reproduction and error logs