tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.23k stars 1.52k forks source link

Cannot load data from imdb_reviews datasets. #2604

Open MAMOMIMOMU opened 3 years ago

MAMOMIMOMU commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior When I run

import tensorflow as tf
import tensorflow_datasets as tfds

imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

the only output is

Downloading and preparing dataset imdb_reviews (80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/0.1.0...
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

and loading takes forever to end(won't end).

I tried to import data from keras.datasets and there seems no problems with it(was able to get the data imported with no warning and error). But the problem is I don't know how to import subwords version using keras.datasets as below;

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)

I also show the details of my environment(I'm using docker)

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
RUN apt-get update && apt-get install -y
sudo
wget
vim
WORKDIR /opt
RUN wget https://repo.continuum.io/archive/Anaconda3-2020.07-Linux-x86_64.sh &&
sh Anaconda3-2020.07-Linux-x86_64.sh -b -p /opt/anaconda3 &&
rm -f Anaconda3-2020.07-Linux-x86_64.sh

ENV PATH /opt/anaconda3/bin:$PATH

RUN conda update conda && conda install
keras
scipy
tensorflow-gpu

WORKDIR /

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--LabApp.tokenh=''"]

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Conchylicultor commented 3 years ago

Note: the subwords8k config is deprecated. Users should use tensorflow_text for tokenising samples.

For your issue, it may be an issue with the internet connection. You should also try with tfds-nightly

MAMOMIMOMU commented 3 years ago

@Conchylicultor Thank you for replying to my question. I don't think there are any problems with my internet connection because I'm not using wifi but a cable. Also I tried many times to download the data while I was able to search Google. Could you show me the detail of what you would expect me to do with tfds-nightly? I successfully pip installed tfds-nightly as below;

pip install -q tfds-nightly
tfds --version

just following this link. https://www.tensorflow.org/datasets/cli?hl=da_DK&skip_cache=true

Conchylicultor commented 3 years ago

TFDS nightly contains the last version of TFDS. You can check import tensorflow_datasets as tfds ; print(tfds.__version__) to use 4.0.0 or above ? We have fixed bugs which might not be available in the version you're using. I believe TFDS version in conda is very outdated.

MAMOMIMOMU commented 3 years ago

@Conchylicultor Thank you very much for giving me the advice. But I'm using tfds of version 4.0.1. I checked the folder which files seemingly related to text processing are in, and the name of the folder is 'deprecated'. Under this folder there is a folder named 'text'. Is there any problem if the 'text' folder is in the 'deprecated' folder? I'm not sure of the meaning of the word 'deprecated', but I'm sure it means sth bad.

MAMOMIMOMU commented 3 years ago

After I ran

ds = tfds.load('mnist', split='train', as_supervised=True)

and loaded the mnist datasets, I got the below output.

Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.

HBox(children=(FloatProgress(value=0.0, description='Dl Completed...', max=4.0, style=ProgressStyle(descriptio…

Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.

After that, I ran

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

and got the output which was different from those of millions of trials before.

WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)
Downloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-train.tfrecord
HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-test.tfrecord
HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-unsupervised.tfrecord
HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))
WARNING:absl:Dataset is using deprecated text encoder API which will be removed soon. Please use the plain_text version of the dataset and migrate to `tensorflow_text`.
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.

However, the data was useless and couldn't run the code below;

tokenizer = info.features['text'].encoder

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NUM_EPOCHS = 10
history = model.fit(train_dataset, epochs=NUM_EPOCHS, validation_data=test_dataset)

saying the input shape is (None, None, None).

Could you tell where the problem is from the output above? I can load the data once out of 100+ trials but the data is broken(I don't know exactly what I need to do to be able to load the data since now, I cannot load any data with the same step) without changing any settings or environment.

PrattJena commented 3 years ago

@Conchylicultor Thank you very much for giving me the advice. But I'm using tfds of version 4.0.1. I checked the folder which files seemingly related to text processing are in, and the name of the folder is 'deprecated'. Under this folder there is a folder named 'text'. Is there any problem if the 'text' folder is in the 'deprecated' folder? I'm not sure of the meaning of the word 'deprecated', but I'm sure it means sth bad.

Deprecated means its no longer in use. As said by @Conchylicultor and also here imdb_reviews uses tfds.deprecated.text.SubwordTextEncoder hence it comes under the deprecated folder. Also as for the code it should run correctly. I tried it on google colab and faced no issues. Beware I stopped the training mid way hence its showing error there. The colab gist can be found here