Open MAMOMIMOMU opened 3 years ago
Note: the subwords8k
config is deprecated. Users should use tensorflow_text
for tokenising samples.
For your issue, it may be an issue with the internet connection. You should also try with tfds-nightly
@Conchylicultor
Thank you for replying to my question. I don't think there are any problems with my internet connection because I'm not using wifi but a cable. Also I tried many times to download the data while I was able to search Google.
Could you show me the detail of what you would expect me to do with tfds-nightly
?
I successfully pip installed tfds-nightly as below;
pip install -q tfds-nightly
tfds --version
just following this link. https://www.tensorflow.org/datasets/cli?hl=da_DK&skip_cache=true
TFDS nightly contains the last version of TFDS. You can check import tensorflow_datasets as tfds ; print(tfds.__version__)
to use 4.0.0 or above ? We have fixed bugs which might not be available in the version you're using. I believe TFDS version in conda
is very outdated.
@Conchylicultor Thank you very much for giving me the advice. But I'm using tfds of version 4.0.1. I checked the folder which files seemingly related to text processing are in, and the name of the folder is 'deprecated'. Under this folder there is a folder named 'text'. Is there any problem if the 'text' folder is in the 'deprecated' folder? I'm not sure of the meaning of the word 'deprecated', but I'm sure it means sth bad.
After I ran
ds = tfds.load('mnist', split='train', as_supervised=True)
and loaded the mnist datasets, I got the below output.
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.
HBox(children=(FloatProgress(value=0.0, description='Dl Completed...', max=4.0, style=ProgressStyle(descriptio…
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
After that, I ran
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
and got the output which was different from those of millions of trials before.
WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)
Downloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…
HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-train.tfrecord
HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-test.tfrecord
HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete4VWZH2/imdb_reviews-unsupervised.tfrecord
HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))
WARNING:absl:Dataset is using deprecated text encoder API which will be removed soon. Please use the plain_text version of the dataset and migrate to `tensorflow_text`.
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.
However, the data was useless and couldn't run the code below;
tokenizer = info.features['text'].encoder
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
NUM_EPOCHS = 10
history = model.fit(train_dataset, epochs=NUM_EPOCHS, validation_data=test_dataset)
saying the input shape is (None, None, None).
Could you tell where the problem is from the output above? I can load the data once out of 100+ trials but the data is broken(I don't know exactly what I need to do to be able to load the data since now, I cannot load any data with the same step) without changing any settings or environment.
@Conchylicultor Thank you very much for giving me the advice. But I'm using tfds of version 4.0.1. I checked the folder which files seemingly related to text processing are in, and the name of the folder is 'deprecated'. Under this folder there is a folder named 'text'. Is there any problem if the 'text' folder is in the 'deprecated' folder? I'm not sure of the meaning of the word 'deprecated', but I'm sure it means sth bad.
Deprecated means its no longer in use. As said by @Conchylicultor and also here imdb_reviews uses tfds.deprecated.text.SubwordTextEncoder
hence it comes under the deprecated folder.
Also as for the code it should run correctly. I tried it on google colab and faced no issues. Beware I stopped the training mid way hence its showing error there. The colab gist can be found here
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior When I run
the only output is
and loading takes forever to end(won't end).
I tried to import data from keras.datasets and there seems no problems with it(was able to get the data imported with no warning and error). But the problem is I don't know how to import subwords version using keras.datasets as below;
I also show the details of my environment(I'm using docker)
Describe the expected behavior
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.