intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Error when train a deep learning model with tensorflow hub keras layer using orca #8897

Open zhijun510 opened 1 year ago

zhijun510 commented 1 year ago

Hello,

I try to train a DL model with tensorhub keras layer (trainable = True), but I got the following error:

_pickle.PickleError: can't pickle repeated message fields, convert to list first

`

def get_model(spark, batch_size, train_data_size):

    finddistance = "hdfs://<path>/universal-sentence-encoder_4"
    finddistancename = "universal-sentence-encoder_4"
    spark.sparkContext.addFile(finddistance, recursive = True)

    embed1 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)
    embed2 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)

    def model_creator(config):
        input_1 = tf.keras.Input(shape=(), name='input_1', dtype = 'string')
        embedding_layer_input_1 = embed1(input_1)

        input_2 = tf.keras.Input(shape=(), name='input_2', dtype = 'string')
        embedding_layer_input_2 = embed2(input_2)

        concat_layer = tf.keras.layers.Concatenate(axis=1)([input_1, input_2])

        output = tf.keras.layers.Dense(1, name='output', activation='sigmoid')(concat_layer)

        model = tf.keras.Model(inputs=[input_1, input_2], outputs=[output])

        num_steps = int(train_data_size / batch_size)

        learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay([0.1])

        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)

        model.compile(
        optimizer=optimizer,
        loss={
            "output": "binary_crossentropy",
        },
        metrics=["accuracy"]
    )

    return model

return model_creator

num_rows = df_train.count()
model = get_model(spark, batch_size, num_rows)
est = Estimator.from_keras(model_creator=model, workers_per_node=2, backend="spark", log_to_driver=False)

est.fit(data=df_train,
    batch_size=batch_size,
    epochs=max_epoch,
    feature_cols=["input_1", "input_2", "input_3"],
    label_cols=["output"],
    steps_per_epoch=num_rows // batch_size)`
hkvision commented 1 year ago

Probably can you try to put the embeds into model_creator and have another try? Something like:

def model_creator(config):
        // Put embed here instead of in get_model
        embed1 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)
        embed2 = hub.KerasLayer(SparkFiles.get(finddistancename), trainable = True)
zhijun510 commented 1 year ago

Thanks for your reply. I tried your solution but now I encountered another error:

_tensorflow.python.framework.errorsimpl.InvalidArgumentError: Graph execution error:

Node: 'Assert/Assert' _assertion failed: [Trying to access a placeholder that is not supposed to be executed. This means you are executing a graph _generated from the cross-replica context in an in-replica context.]_ [[{{node Assert/Assert}}]] [Op:__inference_restored_function_body3839]

Is it caused by the keras layer downloaded from tensorflow hub?

hkvision commented 1 year ago

Sorry for the late reply. We haven't tested for layers using tf hub and we will reproduce it very soon. @sgwhat Please take a look at it.

sgwhat commented 1 year ago

We have reproduced your issue, and the reason is that this model does not support usage under strategy.scope(). According to the official response from TensorFlow Hub team, you may switch to the following models instead:

https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1 https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base-br/1

For more details, you may see the TF Hub Reply.