recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.83k stars 3.07k forks source link

[ASK] OOM issues when training LightGCN on GPU #1804

Open vaibhavsapkal opened 2 years ago

vaibhavsapkal commented 2 years ago

Description

This issue is happening in LightGCN.fit() when running with a dataset of 7 million rows where each user has at least 2 interactions, and there are about 2 million unique users. The number of unique items is around 50k. Here is the config we used { decay: 0.001, batch_size: 512, learning_rate: 0.003, n_layers: 3, epochs: 2, embed_size: 52 }

We tried increasing batch size and issue still persist. For Smaller batch size (16 or 32) we stopped the process because it was running for several hours without any output.

Also, when we tried running it with even larger dataset (~20M) it failed in ImplicitCF function itself.

Here is the GPU we used. image

Here is the error log when trained on 7M rows. 07_05_OOM_message.txt

Other Comments

We tried to train the model on CPU machine, and it was able to train 2 epochs with batch size of 32k in 4 hours.

christopheralex commented 1 year ago

Any update on this ? I am running into the same issue.