google-research / bigbird

Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062
Apache License 2.0
563 stars 101 forks source link

TFDS Custom Dataset Issue - normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. #27

Closed jtfields closed 2 years ago

jtfields commented 2 years ago

I am using BigBird with a custom dataset (essay, label) for classification. I successfully imported the dataset as a custom tfds dataset and the BigBird classifier runs but does not return any results as shown in the log below. In my_datset.py configuration file for tfds, I am using this code to define the text feature - 'text': tfds.features.Text(). However, I believe that I need to add an encoder but TensorFlow has deprecated this in tfds.features.Text and recommends using the new tensorflow_text but doesn't explain how to do this in tfds.features.Text. Can anyone provide a recommendation for how to encode the text so BigBird can perform the classification?

My GPUS are 0 normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. {'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>} Tensor("args_1:0", shape=(), dtype=string) Tensor("args_0:0", shape=(), dtype=int64)

0%| | 0/199 [00:00<?, ?it/s] 42%|████▏ | 84/199 [00:00<00:00, 838.07it/s] 100%|██████████| 199/199 [00:00<00:00, 1124.10it/s]

0%| | 0/2000 [00:00<?, ?it/s] 0%| | 0/2000 [00:00<?, ?it/s] {'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>} Tensor("args_1:0", shape=(), dtype=string) Tensor("args_0:0", shape=(), dtype=int64)

0it [00:00, ?it/s] 0it [00:00, ?it/s] Loss = 0.0 Accuracy = 0.0

jtfields commented 2 years ago

I now believe this is related to the path issue described here → https://github.com/tensorflow/datasets/issues/1544

tfds was pointing to the my_datasets in my home directory and not my virtual environment env3. I haven’t updated create_new_datasets.py as suggested in 1544 because when I uninstalled and re-installed the tensorflow_datasets package to my home and env3 environments it resolved the issue to the point where I can now run BigBird.

Please close this issue. Thank you.