TFRS is hanging forever

Daard commented 1 year ago

Problem

Hello!

I faced problem when I tried to train Retrieval Task. Process just stops and nothing goes.

I tried to decrease amount of candidates for Retrieval Task from 20K to 200 examples, but it does change situation.

I am trying to use Rating task for classification of user likes / dislikes of pages, thus I filtered only positive candidates for Retrieval Task.

I tried to use Model parallelism and used second GPU for Retrieval Task but this approach did not work also.

How to solve this issue? Maybe I can use batch metrics if I have positive and negative ratings already?

2022-12-21 12:58:35.214177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20720 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:06:00.0, compute capability: 8.6
2022-12-21 12:58:35.216048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22309 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:82:00.0, compute capability: 8.6
2022-12-21 12:58:35.216921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 1305 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:83:00.0, compute capability: 8.6
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
WARNING:tensorflow:From /home/larion/miniconda3/envs/tfx_10/lib/python3.9/site-packages/tensorflow_data_validation/utils/statistics_io_impl.py:91: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
WARNING:tensorflow:From /home/larion/miniconda3/envs/tfx_10/lib/python3.9/site-packages/tensorflow_data_validation/utils/statistics_io_impl.py:91: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
265it [00:00, 1159.47it/s]
339it [00:00, 1195.88it/s]
325it [00:00, 1192.51it/s]
IoU: 1.0
WARNING:tensorflow:Layer img_gru will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer img_gru will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
/home/larion/miniconda3/envs/tfx_10/lib/python3.9/site-packages/keras/engine/functional.py:566: UserWarning: Input dict contained keys ['user_id', 'ann_id', 'page_id', 'bin_label'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)
Epoch 1/50
/home/larion/miniconda3/envs/tfx_10/lib/python3.9/site-packages/keras/engine/functional.py:566: UserWarning: Input dict contained keys ['ann_id', 'page_id', 'user_id'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)
2022-12-21 12:59:07.943231: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2022-12-21 12:59:08.848590: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8101

ENV

tfx==1.10
tensorflow-gpu==2.9
tensorflow-recommenders
tensorflow-addons==0.18.0
scikit-learn
matplotlib
jupyterlab
tqdm
python-dotenv
aiomysql
sqlalchemy
motor

Code:

class RecommenderModel(tfrs.models.Model):

    def __init__(self, working_dir:str, unique_user_ids, trans_img_pages_ds):
        super().__init__()

        # User and page models.
        with tf.device('/gpu:0'):
            self.page_model: tf.keras.layers.Layer = build_page_deep_model(working_dir)

            # with tf.device('/gpu:1'):
            self.user_model: tf.keras.layers.Layer = tf.keras.Sequential([
                    tf.keras.layers.IntegerLookup(vocabulary=unique_user_ids, mask_token=None, name='user_lookup'),
                    tf.keras.layers.Embedding(len(unique_user_ids), 128, name='user_emb')
                    ], name='user_vector')

            # A small model to take in user and page embeddings and predict likes.
            self.rating_model = tf.keras.Sequential([
                tf.keras.layers.Dense(128),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Activation('tanh'),
                tf.keras.layers.Dense(64),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Activation('tanh'),
                tf.keras.layers.Dense(1, activation='sigmoid'),
            ], name='rating_batch_norm_model')

            self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
            loss=tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE),
            metrics=[tf.keras.metrics.BinaryAccuracy(), 
                           tf.keras.metrics.Recall(),
                           tf.keras.metrics.Precision()])

        with tf.device('/gpu:1'):
            self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
                metrics=tfrs.metrics.FactorizedTopK(
                    candidates=trans_img_pages_ds.map(self.page_model)))

    def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        u_vector = self.user_model(features["user_id"])[:, 0, :]
        p_vector = self.page_model(features)
        res = self.rating_model(tf.concat([u_vector, p_vector], axis=1))
        return u_vector, p_vector, res

    def compute_loss(self, x_y, training=False) -> tf.Tensor:

        # labels = features.pop("label")
        features = x_y[0]
        labels = x_y[1]
        user_embeddings, page_embeddings, res = self(features)

        # We compute the loss for each task.
        rating_loss = tf.reduce_sum(self.rating_task(
            labels=labels,
            predictions=res,
        )) * (1. / BATCH)

        positive_users = tf.boolean_mask(tf.expand_dims(user_embeddings, 1), labels)
        positive_pages = tf.boolean_mask(tf.expand_dims(page_embeddings, 1), labels)

        retrieval_loss = self.retrieval_task(positive_users, positive_pages)

        return rating_loss + retrieval_loss

# class_weight = {0: 1 / 0.91 / 2, 1:  1 / 0.09 / 2}

tf.keras.backend.clear_session()

# with strategy.scope():
model = RecommenderModel(output_dir, unique_user_ids, trans_img_pages_ds)
# model(x)
# model.load_weights('./checkpoints/rec_img_wide_model_epoch-95_val_recall-0.569.h5')

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
            initial_learning_rate=1e-3,
            decay_steps=10000,
            decay_rate=0.95)    
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
# metrics=[tf.keras.metrics.BinaryAccuracy(), 
#                            tf.keras.metrics.Recall(),
#                            tf.keras.metrics.Precision()]

model.compile(optimizer=optimizer)
              # loss=tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.SUM), 
              # metrics=metrics)

callbacks = _build_callbacks('deep_ret_cls_model_1', monitor='recall')

steps_per_epoch, validation_steps = compute_steps(BATCH)

model.fit(
  train_img_ds,
  # validation_data=test_img_ds,
  initial_epoch=0, 
  epochs=50, shuffle=False, verbose=1, 
  steps_per_epoch=steps_per_epoch,
    # validation_steps=validation_steps,
  callbacks=callbacks)
  # class_weight=class_weight)

Daard commented 1 year ago

I have changed my dataset and left only positive user-pages pairs for training only retrieval task. I have tested this guide and modified my compute_loss accordingly:

retrieval_loss = self.retrieval_task(user_embeddings,
                                             page_embeddings,
                                             candidate_ids=features["page_id"], # <- index of items
                                             compute_metrics=not training)

I tried to use evaluate method outside training loop, but it is hanging. I also see lots of processes in the CPUs during evaluation. I waited 20 minutes but it did not help. User-item positive pair dataset is small - only 7K pairs, there are also small amount of items - 5K.

I tried to use BruteForce also, but it stucked also:

brute_force = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
brute_force.index_from_dataset(trans_pages_ds32.map(lambda features: (features['page_id'], model.page_model(features))))

I updated my environment to the latest version of TFX (1.12) and Tensorflow (2.11) . During the training I faced this warnings:

2022-12-28 13:16:18.129750: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:115] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

But the loss was reducing, thus I did not fix CUDA version.

What should I do with this issue?

Daard commented 1 year ago

The problem was caused by pages_ds. I used tf.data.experimental.make_batched_features_dataset to read prepared data from tfrecord, but in such way I created infinite loop over my iyems data. When I added take(n) my problem was gone.

tensorflow / recommenders

TFRS is hanging forever #599