negative loss and mismatched dimension when load pretrained weights

wg-li commented 1 year ago

Hello,

I got another two problems when I carried out experiments on Avazu dataset:

When I do pretraining step, it always shows negative loss which is somehow strange even though it still decreases.

08/11 02:07:09 PM client generated: 2 08/11 02:07:09 PM Cross-Party Train Epoch 0, training on aligned data, LR: 0.1, sample: 16384 08/11 02:07:10 PM Cross-Party SSL Train Epoch 0, client loss aligned: [-0.16511965772951953, -0.152420010213973] 08/11 02:07:10 PM Local SSL Train Epoch 0, training on local data, sample: 80384 08/11 02:07:22 PM Local SSL Train Epoch 0, client loss local: [-0.5874887084815307, -0.5748279593279881] 08/11 02:07:22 PM Local SSL Train Epoch 0, AGG MODE pma, client loss agg: [] 08/11 02:07:24 PM ###### Valid Epoch 0 Start ##### 08/11 02:07:24 PM Valid Epoch 0, valid client loss aligned: [-0.3176240861415863, -0.22815129309892654] 08/11 02:07:24 PM Valid Epoch 0, valid client loss local: [-0.22939987406134604, -0.22190943509340286] 08/11 02:07:24 PM Valid Epoch 0, valid client loss regularized: [0.0, 0.0] 08/11 02:07:24 PM Valid Epoch 0, Loss_aligned -0.273 Loss_local -0.226

when I do the finetune step, it shows error information below:

File "/data/nfs/user/liwg/vfl/fedhssl/FedHSSL/models/model_templates.py", line 206, in load_encoder_cross self.encoder_cross.load_state_dict(torch.load(load_path, map_location=device)) File "/data/nfs/miniconda/envs/liwg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1482, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for DNNFM: size mismatch for embedding_dict.device_ip.weight: copying a param with shape torch.Size([70769, 32]) from checkpoint, the shape in current model is torch.Size([70768, 32]). size mismatch for embedding_dict.device_model.weight: copying a param with shape torch.Size([3066, 32]) from checkpoint, the shape in current model is torch.Size([3065, 32]). size mismatch for embedding_dict.C14.weight: copying a param with shape torch.Size([1699, 32]) from checkpoint, the shape in current model is torch.Size([1698, 32]).

The pretrained encoder_cross weight is one dimension larger than the expected.

jorghyq2016 commented 1 year ago

Hi,

the sign of the loss depends on the specific loss you have used.
it seems there is a mismatch between the input data and the model parameters, you could check the format of the data you used for pretrain and finetune.

Best.

wg-li commented 1 year ago

Hello,

I found there are two places needed to be carefully checked:

In line 214 prepare_experiments.py, "train_dataset_aug" is not defined when exp_type=='cls' which should be the cases for vanilla classification or finetuning step.
In line 197 ctr_dataset.py, why "data[feat].nunique()+1" for feature_columns in AvazuAug2party, comparing to "data[feat].nunique()" in Avazu2party. This actually causes the aforementioned dimension mismatch problem rather than the format of data.

Best

jorghyq2016 / FedHSSL

negative loss and mismatched dimension when load pretrained weights #2