google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 571 forks source link

How to get the test embeddings from output of fine-tuned model (tutorial) #260

Open kelseyneis opened 2 years ago

kelseyneis commented 2 years ago

Is there a way to easily generate the embeddings of the test data from a fine-tuned model?

Here's what I've tried:

I followed the tutorial on MRPC with these flags (all default except predict=true and export_dir=dir):

os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR
!python -m albert.run_classifier \
  --data_dir="glue_data/" \
  --output_dir=$OUTPUT_DIR \
  --albert_hub_module_handle=$ALBERT_MODEL_HUB \
  --spm_model_file="from_tf_hub" \
  --do_train=True \
  --do_eval=True \
  --do_predict=True \
  --max_seq_length=512 \
  --optimizer=adamw \
  --task_name=$TASK \
  --warmup_step=200 \
  --learning_rate=2e-5 \
  --train_step=800 \
  --save_checkpoints_steps=100 \
  --train_batch_size=32 \
  --tpu_name=$TPU_ADDRESS \
  --use_tpu=True \
  --export_dir=$OUTPUT_DIR + "/saved_models/"

This gave me a saved_model.pb file, which I wanted to load in order to generate embeddings for the test data, in order to do some error analysis.

I tried running something similar to this code:

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "http://tfhub.dev/tensorflow/albert_en_preprocess/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/albert_en_base/3",
    trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].

embedding_model = tf.keras.Model(text_input, pooled_output)
sentences = tf.constant(["hello", "hello"])
print(embedding_model(sentences))

This worked with the base model from TensorFlow Hub, but when I replaced the url with the location of my saved model folder (which also included assets/ and variables/), I got the following error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-38-6c11f4769dd0> in <module>()
      5     signature='tokens',
      6     signature_outputs_as_dict=True)
----> 7 encoder_inputs = preprocessor(text_input)
      8 encoder = hub.KerasLayer(
      9     "https://tfhub.dev/tensorflow/albert_en_base/3",

1 frames

/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    690       except Exception as e:  # pylint:disable=broad-except
    691         if hasattr(e, 'ag_error_metadata'):
--> 692           raise e.ag_error_metadata.to_exception(e)
    693         else:
    694           raise

TypeError: Exception encountered when calling layer "keras_layer_7" (type KerasLayer).

in user code:

    File "/usr/local/lib/python3.7/dist-packages/tensorflow_hub/keras_layer.py", line 229, in call  *
        result = f()

    TypeError: pruned(input_ids, input_mask, segment_ids) takes 0 positional arguments, got 1.

Call arguments received:
  • inputs=tf.Tensor(shape=(None,), dtype=string)
  • training=False

This may come down to my limited knowledge of TensorFlow, but the albert code is giving me a saved_model which seems to be of a different format than other saved_models I've used. Can the saved model generated by the albert classifier be used in this way?