Open nippleshot opened 1 year ago
Hmm first issue, seems like potential edge case for recent change to input_dist @joshuadeng do you know off top of head?
second issue looks like settings aren't exactly the same? num_epochs+batch_size etc
@YLGH For the 2nd issue, When I had a same argument setting w/ this method, average loss and metrics for each epoch are almost stable. Using default argument shows little bit better to me, but neither of them is well trained.
torchx run -s local_cwd dist.ddp -j 1x1 --script bert4rec_main.py -- --dataset_name ml-1m --dataset_path /datasets/ml-1m --lr 0.001 --mask_prob 0.2 --weight_decay 0.00001 --train_batch_size 256 --val_batch_size 256 --test_batch_size 256 --max_len 100 --emb_dim 256 --num_epochs 30
Epoch 1, average loss 7.620365257446583
Epoch 10, average loss 7.540571250758329
Epoch 20, average loss 7.5394956561235285
Epoch 30, average loss 7.539466720360976
Epoch 1, metrics {'Recall@1': 0.012763843250771364, 'Recall@5': 0.060915227668980755, 'Recall@10': 0.11845531811316808, 'NDCG@5': 0.036595141515135765, 'NDCG@10': 0.05504566555221876} Epoch 10, metrics {'Recall@1': 0.016010485201453168, 'Recall@5': 0.06351082772016525, 'Recall@10': 0.11540570172170798, 'NDCG@5': 0.039478599869956575, 'NDCG@10': 0.05607946729287505} Epoch 20, metrics {'Recall@1': 0.01390316616743803, 'Recall@5': 0.059990062999228634, 'Recall@10': 0.1187722726414601, 'NDCG@5': 0.036320349046339594, 'NDCG@10': 0.055121896167596184} Epoch 30, metrics {'Recall@1': 0.01286663922170798, 'Recall@5': 0.06471011508256197, 'Recall@10': 0.12051123877366383, 'NDCG@5': 0.03837353542136649, 'NDCG@10': 0.05636422669825455}
Hi @nippleshot, I'm unable to reproduce your error on a setup with V100 GPUs. Can you try running on random data to see if you get this error as well?
Hello @joshuadeng,
Hello, While I try to run bert4rec example code, I have faced 2 problems and I hope I can get some feedback.
1) When I try to run bert4rec model with multiple TITAN XP GPUs, it shows following RuntimeError.
(btw I used MovieLens 1M dataset)
torchx run -s local_cwd dist.ddp -j 1x8 --script bert4rec_main.py -- --dataset_name ml-1m --dataset_path /datasets/ml-1m --lr 0.001 --mask_prob 0.2 --weight_decay 0.00001 --train_batch_size 64 --val_batch_size 64 --max_len 200 --emb_dim 64 --num_epochs 100
2) When I try to run bert4rec model with single TITAN XP GPU, it runs without any RuntimeError. However, The model metric I got was too low compared to Recommendation Metrics Reproduce. I wonder why this kind of situation happens.
Running command :
torchx run -s local_cwd dist.ddp -j 1x1 --script bert4rec_main.py -- --dataset_name ml-1m --dataset_path /datasets/ml-1m --num_epochs 1000
Model metrics and average loss per training epoch :
Epoch 1, metrics {'Recall@1': 0.0126953125, 'Recall@5': 0.0634765625, 'Recall@10': 0.12141927083333333, 'NDCG@5': 0.03733245619029427, 'NDCG@10': 0.05605129435813675} Epoch 50, metrics {'Recall@1': 0.01220703125, 'Recall@5': 0.06711154516475897, 'Recall@10': 0.12320963541666667, 'NDCG@5': 0.039204553118906915, 'NDCG@10': 0.05708927156714102} Epoch 100, metrics {'Recall@1': 0.010904947916666666, 'Recall@5': 0.05805121532951792, 'Recall@10': 0.11387803824618459, 'NDCG@5': 0.033518949057906866, 'NDCG@10': 0.05138459533918649} Epoch 150, metrics {'Recall@1': 0.01123046875, 'Recall@5': 0.06141493058142563, 'Recall@10': 0.11474609375, 'NDCG@5': 0.035709500865777954, 'NDCG@10': 0.052741181144180395} Epoch 190, metrics {'Recall@1': 0.014051649331425628, 'Recall@5': 0.06287977433142562, 'Recall@10': 0.11404079866285126, 'NDCG@5': 0.03795023405109532, 'NDCG@10': 0.05423267767764628} ...