This issue is happening in LightGCN.fit() when running with a dataset of 7 million rows where each user has at least 2 interactions, and there are about 2 million unique users. The number of unique items is around 50k.
Here is the config we used
{
decay: 0.001,
batch_size: 512,
learning_rate: 0.003,
n_layers: 3,
epochs: 2,
embed_size: 52
}
We tried increasing batch size and issue still persist.
For Smaller batch size (16 or 32) we stopped the process because it was running for several hours without any output.
Also, when we tried running it with even larger dataset (~20M) it failed in ImplicitCF function itself.
Description
This issue is happening in LightGCN.fit() when running with a dataset of 7 million rows where each user has at least 2 interactions, and there are about 2 million unique users. The number of unique items is around 50k. Here is the config we used { decay: 0.001, batch_size: 512, learning_rate: 0.003, n_layers: 3, epochs: 2, embed_size: 52 }
We tried increasing batch size and issue still persist. For Smaller batch size (16 or 32) we stopped the process because it was running for several hours without any output.
Also, when we tried running it with even larger dataset (~20M) it failed in ImplicitCF function itself.
Here is the GPU we used.
Here is the error log when trained on 7M rows. 07_05_OOM_message.txt
Other Comments
We tried to train the model on CPU machine, and it was able to train 2 epochs with batch size of 32k in 4 hours.