Closed Daard closed 1 year ago
I have changed my dataset and left only positive user-pages pairs for training only retrieval task. I have tested this guide and modified my compute_loss accordingly:
retrieval_loss = self.retrieval_task(user_embeddings,
page_embeddings,
candidate_ids=features["page_id"], # <- index of items
compute_metrics=not training)
I tried to use evaluate method outside training loop, but it is hanging. I also see lots of processes in the CPUs during evaluation. I waited 20 minutes but it did not help. User-item positive pair dataset is small - only 7K pairs, there are also small amount of items - 5K.
I tried to use BruteForce also, but it stucked also:
brute_force = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
brute_force.index_from_dataset(trans_pages_ds32.map(lambda features: (features['page_id'], model.page_model(features))))
I updated my environment to the latest version of TFX (1.12) and Tensorflow (2.11) . During the training I faced this warnings:
2022-12-28 13:16:18.129750: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:115] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.
You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
But the loss was reducing, thus I did not fix CUDA version.
What should I do with this issue?
The problem was caused by pages_ds. I used tf.data.experimental.make_batched_features_dataset to read prepared data from tfrecord, but in such way I created infinite loop over my iyems data. When I added take(n) my problem was gone.
Problem
Hello!
I faced problem when I tried to train Retrieval Task. Process just stops and nothing goes.
I tried to decrease amount of candidates for Retrieval Task from 20K to 200 examples, but it does change situation.
I am trying to use Rating task for classification of user likes / dislikes of pages, thus I filtered only positive candidates for Retrieval Task.
I tried to use Model parallelism and used second GPU for Retrieval Task but this approach did not work also.
How to solve this issue? Maybe I can use batch metrics if I have positive and negative ratings already?
ENV
Code: