pytorch / torchrec

Pytorch domain library for recommendation systems
https://pytorch.org/torchrec/
BSD 3-Clause "New" or "Revised" License
1.93k stars 430 forks source link

Difference in model implementation and Accuracy Discrepancy between TorchRec and Pytorch DLRMs #1001

Open allenfengjr opened 1 year ago

allenfengjr commented 1 year ago

Hello,

I’m currently testing the two versions of DLRM, i.e., one implemented in Pytorch and one implemented in TorchRec. I observed an accuracy discrepancy between these two versions during training.

Specifically, I have been training the two DLRM versions/models on Criteo 1TB dataset; after one epoch, the testing accuracy of the TorchRec version converges to 96.6%. Please see the below figure (x-axis is # of iteration within one epoch).

DLRM-accuracy

However, based on the referenced figure from the Pytorch DLRM repo, the testing accuracy of the Pytorch version is supposed to be around 81%.

Not only about the accuracy, I also checked the training/validation loss. The loss values I got from the TorchRec version (i.e., about 0.13, see below figure) are much lower than the reference loss values (i.e., about 0.42) given by the Pytorch version.

DLRM-loss

I also verified that after one epoch the AUROC score of the TorchRec version can reach to 0.802, which is consistent with the number reported in the TorchRec DLRM repo.

Below is my configuration. It is similar to the configuration provided in the reference doc. The only difference is I used 16 GPUs from 4 nodes. I also checked the MD5 checksums of the training data files, which are exactly the same as the checksums in checksum file.

srun -n 16 python -u dlrm_main.py --epochs=1 \
--in_memory_binary_criteo_path="/N/scratch/haofeng/TB/processed" \
--num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" \
--embedding_dim 128 \
--batch_size 8192 \
--learning_rate 0.05 \
--over_arch_layer_sizes "1024,1024,512,256,1" \
--dense_arch_layer_sizes "512,256,128" \
--shuffle_batches \
--print_sharding_plan \
--mmap_mode

I also tried different learning rates, including 0.05, 0.5, 1, 7.5, 15, but their accuracies are still much higher than the reference accuracy around 81%.

I’m wondering if anyone possibly knows where the accuracy discrepancy could be coming from. Is there any reference of training/testing accuracy of the TorchRec DLRM version? What’s the main difference between these two models?

Thanks, Hao

colin2328 commented 1 year ago

Hi Hao, If you are running those two examples, one is using DLRMv2 (i.e including a different DCNv2 arch, Adagrad optimizer) and the other is using DLRM (v1) .

samiwilf commented 1 year ago

@allenfengjr

Sub-sampled and non-sub-sampled results are not supposed to be identical. The figure is titled "Terabyte Data (sub-sampled=0.875)". Notice sub-sampled=0.875. Sub-sampling alters the prediction metrics.

Global batch size is not matching. Running 16 instances each with a local batch size of 8192 means the global batch size is 131,072. That size is larger than any global batch size shown in the table. One thing to watch out for is that --batch_size is the local batch_size in torchrec/dlrm_main.py, while --mini-batch-size is the global batch size in dlrm_s_pytorch.py.

Besides the two points above, your command to run dlrm_main.py looks good if you're attempting to run the same settings on both dlrm_s_pytorch.py and torchrec_dlrm/dlrm_main.py. The command would run an optimizer (SGD) that both scripts run by default and an interaction type (dot interaction) that both scripts run by default.