Open wg-li opened 1 year ago
Hi,
the sign of the loss depends on the specific loss you have used.
it seems there is a mismatch between the input data and the model parameters, you could check the format of the data you used for pretrain and finetune.
Best.
Hello,
I found there are two places needed to be carefully checked:
Best
Hello,
I got another two problems when I carried out experiments on Avazu dataset:
08/11 02:07:09 PM client generated: 2 08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384 08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973] 08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384 08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881] 08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: [] 08/11 02:07:24 PM ###### Valid Epoch 0 Start ##### 08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654] 08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286] 08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0] 08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226
File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device)) File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for DNNFM: size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]). size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]). size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).
The pretrained encoder_cross weight is one dimension larger than the expected.