Error when train a deep learning model with tensorflow hub keras layer using orca

zhijun510 commented 1 year ago

Hello,

I try to train a DL model with tensorhub keras layer (trainable = True), but I got the following error:

_pickle.PickleError: can't pickle repeated message fields, convert to list first

`

def get_model(spark, batch_size, train_data_size):

    finddistance = "hdfs://<path>/universal-sentence-encoder_4"
    finddistancename = "universal-sentence-encoder_4"
    spark.sparkContext.addFile(finddistance, recursive = True)

    embed1 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)
    embed2 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)

    def model_creator(config):
        input_1 = tf.keras.Input(shape=(), name='input_1', dtype = 'string')
        embedding_layer_input_1 = embed1(input_1)

        input_2 = tf.keras.Input(shape=(), name='input_2', dtype = 'string')
        embedding_layer_input_2 = embed2(input_2)

        concat_layer = tf.keras.layers.Concatenate(axis=1)([input_1, input_2])

        output = tf.keras.layers.Dense(1, name='output', activation='sigmoid')(concat_layer)

        model = tf.keras.Model(inputs=[input_1, input_2], outputs=[output])

        num_steps = int(train_data_size / batch_size)

        learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay([0.1])

        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)

        model.compile(
        optimizer=optimizer,
        loss={
            "output": "binary_crossentropy",
        },
        metrics=["accuracy"]
    )

    return model

return model_creator

num_rows = df_train.count()
model = get_model(spark, batch_size, num_rows)
est = Estimator.from_keras(model_creator=model, workers_per_node=2, backend="spark", log_to_driver=False)

est.fit(data=df_train,
    batch_size=batch_size,
    epochs=max_epoch,
    feature_cols=["input_1", "input_2", "input_3"],
    label_cols=["output"],
    steps_per_epoch=num_rows // batch_size)`

hkvision commented 1 year ago

Probably can you try to put the embeds into model_creator and have another try? Something like:

def model_creator(config):
        // Put embed here instead of in get_model
        embed1 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)
        embed2 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)

zhijun510 commented 1 year ago

Thanks for your reply. I tried your solution but now I encountered another error:

_tensorflow.python.framework.errorsimpl.InvalidArgumentError: Graph execution error:

Node: 'Assert/Assert' _assertion failed: [Trying to access a placeholder that is not supposed to be executed. This means you are executing a graph _generated from the cross-replica context in an in-replica context.]_ [[{{node Assert/Assert}}]] [Op:__inference_restored_function_body3839]

Is it caused by the keras layer downloaded from tensorflow hub?

hkvision commented 1 year ago

Sorry for the late reply. We haven't tested for layers using tf hub and we will reproduce it very soon. @sgwhat Please take a look at it.

sgwhat commented 1 year ago

We have reproduced your issue, and the reason is that this model does not support usage under strategy.scope(). According to the official response from TensorFlow Hub team, you may switch to the following models instead:

https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1 https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base-br/1

For more details, you may see the TF Hub Reply.

intel-analytics / ipex-llm

Error when train a deep learning model with tensorflow hub keras layer using orca #8897