marrlab / DomainLab

modular domain generalization: https://pypi.org/project/domainlab/
https://marrlab.github.io/DomainLab/
MIT License
41 stars 2 forks source link

potential model name collision of models on slurm cluster: afterall: domainlab/compos/exp/exp_utils.py", line 99, in load #227

Closed smilesun closed 11 months ago

smilesun commented 1 year ago

currently, the code is like this:

   def after_all(self):
        """
        After training is done
        """
        model_ld = None
        try:
            model_ld = self.exp.visitor.load()
        except FileNotFoundError:
            # this can happen if loss is increasing, model never get selected
            return

        model_ld = model_ld.to(self.device)
        model_ld.eval()
        print("persisted model performance metric: \n")
        metric_te = model_ld.cal_perf_metric(self.loader_te, self.device)
        self.dump_prediction(model_ld, metric_te)
        self.exp.visitor(metric_te)
        # prediction dump of test domain is essential to verify the prediction results

The error to report here is that in afterall, the model loaded does not match the model in memory in size for diva and hduva, see below

in slurm log file:

run_experiment-index=13-11936875.out

the output file displays no error but with early stop finished.

processing dataset for  hduva
b'ed330e4'
model name: pathlist_te_sketch_hduva_bed330e4_2023md_06md_06_14_55_52_seed_0

with


epoch: 321  now:  2023-06-06 21:31:53.171784 epoch time:  0:01:13.399267 used:  6:35:56.573210 model:  pathlist_te_sketch_hduva_bed330e4_2023md_06md_06_14_55_52_seed_0
epoch: 322
pooled train domains performance:
{
    'acc': 1.0,
    'precision': 1.0,
    'recall': 1.0,
    'specificity': 1.0,
    'f1': 1.0,
    'auroc': 1.0
}
confusion matrix:
     0    1    2    3    4    5     6
0  860    0    0    0    0    0     0
1    0  821    0    0    0    0     0
2    0    0  730    0    0    0     0
3    0    0    0  453    0    0     0
4    0    0    0    0  649    0     0
5    0    0    0    0    0  773     0
6    0    0    0    0    0    0  1154
out of domain test performance:
{
    'acc': 0.5907675,
    'precision': 0.576159,
    'recall': 0.53352654,
    'specificity': 0.92936075,
    'f1': 0.523332,
    'auroc': 0.8630302
}
confusion matrix:
     0    1    2    3    4   5    6
0  180  231   17    1  197   2   66
1   30  585    2    0   22   0   26
2   45   75  237   15  188  13  104
3    2   22   14  501    1   3    4
4   20  123   28    2  532  10   19
5    0   35    1    0    0  35    0
6    0  101   26    0    0   0   16
early stop counter:  6
loss:3847192432.0, best loss: 3846092536.0
epoch: 322  now:  2023-06-06 21:33:05.517891 epoch time:  0:01:12.346107 used:  6:37:08.919317 model:  pathlist_te_sketch_hduva_bed330e4_2023md_06md_06_14_55_52_seed_0
early stop trigger
Experiment finished at epoch: 322 with time: 6:37:08.919317 at 2023-06-06 21:33:05.517891

but in the error file: run_experiment-index=13-11936875.err

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, nvidia_gpu=1
Select jobs to execute...

[Tue Jun  6 14:52:26 2023]
rule run_experiment:
    input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
    jobid: 0
    reason: Missing output files: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
    wildcards: index=13
    resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=gpu_p, qos=gpu, nvidia_gpu=1

/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
/home/aih/xudong.sun/domainlab4matchdg/domainlab/algos/trainers/train_visitor.py:31: UserWarning: hyper-parameter scheduler not set,                           going to use default Warmpup and epoch update
  warnings.warn("hyper-parameter scheduler not set, \
[Tue Jun  6 21:33:06 2023]
Error in rule run_experiment:
    jobid: 0
    input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv

RuleException:
RuntimeError in file /home/aih/xudong.sun/domainlab4matchdg/domainlab/exp_protocol/benchmark.smk, line 121:
Error(s) in loading state_dict for ModelHDUVA:
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.weight: copying a param with shape torch.Size([32, 4096]) from checkpoint, the shape in current model is torch.Size([64, 4096]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.weight: copying a param with shape torch.Size([32, 4096]) from checkpoint, the shape in current model is torch.Size([64, 4096]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for encoder.net_infer_zx.fc_loc.0.weight: copying a param with shape torch.Size([64, 179776]) from checkpoint, the shape in current model is torch.Size([32, 179776]).
    size mismatch for encoder.net_infer_zx.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for encoder.net_infer_zx.fc_scale.0.weight: copying a param with shape torch.Size([64, 179776]) from checkpoint, the shape in current model is torch.Size([32, 179776]).
    size mismatch for encoder.net_infer_zx.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for encoder.net_infer_zy.net_fc_mean.0.weight: copying a param with shape torch.Size([32, 4096]) from checkpoint, the shape in current model is torch.Size([64, 4096]).
    size mismatch for encoder.net_infer_zy.net_fc_mean.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for encoder.net_infer_zy.net_fc_scale.0.weight: copying a param with shape torch.Size([32, 4096]) from checkpoint, the shape in current model is torch.Size([64, 4096]).
    size mismatch for encoder.net_infer_zy.net_fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for decoder.net_fc_z2flat_img.0.h.weight: copying a param with shape torch.Size([150528, 131]) from checkpoint, the shape in current model is torch.Size([150528, 163]).
    size mismatch for decoder.net_fc_z2flat_img.0.g.weight: copying a param with shape torch.Size([150528, 131]) from checkpoint, the shape in current model is torch.Size([150528, 163]).
    size mismatch for net_p_zy.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([32, 7]) from checkpoint, the shape in current model is torch.Size([64, 7]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zy.fc_loc.0.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([64, 64]).
    size mismatch for net_p_zy.fc_loc.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zy.fc_scale.0.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([64, 64]).
    size mismatch for net_p_zy.fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_classif_y.op_linear.weight: copying a param with shape torch.Size([7, 32]) from checkpoint, the shape in current model is torch.Size([7, 64]).
    size mismatch for net_p_zd.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([32, 3]) from checkpoint, the shape in current model is torch.Size([64, 3]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zd.fc_loc.0.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([64, 64]).
    size mismatch for net_p_zd.fc_loc.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
    size mismatch for net_p_zd.fc_scale.0.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([64, 64]).
    size mismatch for net_p_zd.fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]).
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/exp_protocol/benchmark.smk", line 121, in __rule_run_experiment
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/exp_protocol/run_experiment.py", line 109, in run_experiment
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/compos/exp/exp_main.py", line 70, in execute
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/algos/trainers/a_trainer.py", line 106, in post_tr
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/algos/observers/c_obvisitor_cleanup.py", line 12, in after_all
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/algos/observers/b_obvisitor.py", line 61, in after_all
  File "/home/aih/xudong.sun/domainlab4matchdg/domainlab/compos/exp/exp_utils.py", line 99, in load
  File "/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
  File "/home/aih/xudong.sun/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
smilesun commented 1 year ago

while in snakemake console, the snakemake stops making further submissions.

Select jobs to execute...

10, Task2, hduva, 333, sketch, 2, "{'gamma_y': 192400, 'zx_dim': 64, 'zy_dim': 64, 'zd_dim': 64}", 0.59274995, 0.60296017, 0.52567023, 0.929962, 0.52843946, 0.8426663                                                                                                                   
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, nvidia_gpu=1
Select jobs to execute...

    reason: Missing output files: zoutput/benchmarks/pacs_benchmark/rule_results/10.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    wildcards: index=10
    resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1

Submitted job 14 with external jobid 'Submitted batch job 11947037'.
[Tue Jun  6 21:33:15 2023]
Error in rule run_experiment:
    jobid: 17
    input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
    cluster_jobid: Submitted batch job 11936875

Error executing rule run_experiment on cluster (jobid: 17, external: Submitted batch job 11936875, jobscript: /home/aih/xudong.sun/domainlab4matchdg/.snakemake/tmp.q1xxt1ea/snakejob.run_experiment.17.sh). For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 17.
Select jobs to execute...

[Tue Jun  6 21:33:15 2023]
rule run_experiment:
    input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
    jobid: 17
    reason: Missing output files: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
    wildcards: index=13
    resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1

Submitted job 17 with external jobid 'Submitted batch job 11949921'.
Submitted job 17 with external jobid 'Submitted batch job 11949921'.
[Wed Jun  7 14:24:10 2023]
Finished job 7.
2 of 20 steps (10%) done
[Wed Jun  7 19:21:47 2023]
Finished job 6.
3 of 20 steps (15%) done
[Wed Jun  7 23:53:59 2023]
Finished job 5.
4 of 20 steps (20%) done
[Thu Jun  8 00:31:29 2023]
Finished job 12.
5 of 20 steps (25%) done
[Thu Jun  8 00:55:59 2023]
Finished job 15.
6 of 20 steps (30%) done
[Thu Jun  8 01:22:02 2023]
Finished job 19.
7 of 20 steps (35%) done
[Thu Jun  8 02:01:41 2023]
Finished job 10.
8 of 20 steps (40%) done
[Thu Jun  8 04:01:49 2023]
Finished job 11.
9 of 20 steps (45%) done
[Thu Jun  8 05:08:29 2023]
Finished job 9.
10 of 20 steps (50%) done
[Thu Jun  8 08:49:49 2023]
Finished job 8.
11 of 20 steps (55%) done
[Thu Jun  8 08:55:07 2023]
Finished job 18.
12 of 20 steps (60%) done
[Thu Jun  8 09:55:37 2023]
Finished job 13.
13 of 20 steps (65%) done
[Thu Jun  8 12:23:34 2023]
Finished job 3.
14 of 20 steps (70%) done
[Thu Jun  8 13:16:02 2023]
Finished job 14.
15 of 20 steps (75%) done
[Thu Jun  8 18:39:06 2023]
Finished job 17.
16 of 20 steps (80%) done
smilesun commented 1 year ago

From another run (from a different folder), similar error occurs while in the out file the last line read early stop encountered


Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, nvidia_gpu=1
Select jobs to execute...

[Mon Jun  5 14:17:56 2023]
rule run_experiment:
    input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
    jobid: 0
    reason: Missing output files: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
    wildcards: index=15
    resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=gpu_p, qos=gpu, nvidia_gpu=1

/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
/home/aih/xudong.sun/DomainLab/domainlab/algos/trainers/train_visitor.py:31: UserWarning: hyper-parameter scheduler not set,                           going to use default Warmpup and epoch update
  warnings.warn("hyper-parameter scheduler not set, \
[Mon Jun  5 21:42:32 2023]
Error in rule run_experiment:
    jobid: 0
    input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv

RuleException:
RuntimeError in file /home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/benchmark.smk, line 121:
Error(s) in loading state_dict for ModelHDUVA:
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
    size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for encoder.net_infer_zy.net_fc_mean.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
    size mismatch for encoder.net_infer_zy.net_fc_mean.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for encoder.net_infer_zy.net_fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
    size mismatch for encoder.net_infer_zy.net_fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for decoder.net_fc_z2flat_img.0.h.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 99]).
    size mismatch for decoder.net_fc_z2flat_img.0.g.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 99]).
    size mismatch for net_p_zy.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 7]) from checkpoint, the shape in current model is torch.Size([32, 7]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zy.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zy.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
    size mismatch for net_p_zy.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zy.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
    size mismatch for net_p_zy.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_classif_y.op_linear.weight: copying a param with shape torch.Size([7, 64]) from checkpoint, the shape in current model is torch.Size([7, 32]).
    size mismatch for net_p_zd.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 3]) from checkpoint, the shape in current model is torch.Size([32, 3]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zd.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
    size mismatch for net_p_zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
    size mismatch for net_p_zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
    size mismatch for net_p_zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
  File "/home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/benchmark.smk", line 121, in __rule_run_experiment
  File "/home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/run_experiment.py", line 109, in run_experiment
  File "/home/aih/xudong.sun/DomainLab/domainlab/compos/exp/exp_main.py", line 70, in execute
  File "/home/aih/xudong.sun/DomainLab/domainlab/algos/trainers/a_trainer.py", line 106, in post_tr
  File "/home/aih/xudong.sun/DomainLab/domainlab/algos/observers/c_obvisitor_cleanup.py", line 12, in after_all
  File "/home/aih/xudong.sun/DomainLab/domainlab/algos/observers/b_obvisitor.py", line 61, in after_all
  File "/home/aih/xudong.sun/DomainLab/domainlab/compos/exp/exp_utils.py", line 99, in load
  File "/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
  File "/home/aih/xudong.sun/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

while the snakemake log is

[Mon Jun  5 21:42:52 2023]
Error in rule run_experiment:
    jobid: 19
    input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
    cluster_jobid: Submitted batch job 11886013

Error executing rule run_experiment on cluster (jobid: 19, external: Submitted batch job 11886013, jobscript: /home/aih/xudong.sun/DomainLab/.snakemake/tmp.zrmtxjva/snakejob.run_experiment.19.sh). For error details see the cluster log and the log files of the involved rule(s).    
Trying to restart job 19.
Select jobs to execute...

[Mon Jun  5 21:42:52 2023]
rule run_experiment:
    input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
    output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
    jobid: 19
    reason: Missing output files: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv                                                                                   
    wildcards: index=15
    resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1

Submitted job 19 with external jobid 'Submitted batch job 11896751'.
[Tue Jun  6 03:07:28 2023]
Finished job 9.
2 of 20 steps (10%) done
[Tue Jun  6 03:17:57 2023]
Finished job 3.
3 of 20 steps (15%) done
[Tue Jun  6 03:40:02 2023]
Finished job 14.
4 of 20 steps (20%) done
[Tue Jun  6 04:04:40 2023]
Finished job 6.
5 of 20 steps (25%) done
[Tue Jun  6 04:42:33 2023]
Finished job 10.
6 of 20 steps (30%) done
[Tue Jun  6 05:02:58 2023]
Finished job 16.
7 of 20 steps (35%) done
[Tue Jun  6 05:06:24 2023]
Finished job 8.
8 of 20 steps (40%) done
[Tue Jun  6 08:48:47 2023]
Finished job 5.
9 of 20 steps (45%) done
[Tue Jun  6 09:00:37 2023]
Finished job 7.
10 of 20 steps (50%) done
[Tue Jun  6 10:37:21 2023]
Finished job 13.
11 of 20 steps (55%) done
[Tue Jun  6 16:31:37 2023]
Finished job 12.
12 of 20 steps (60%) done
[Tue Jun  6 19:04:41 2023]
Finished job 15.
13 of 20 steps (65%) done
[Tue Jun  6 20:26:16 2023]
Finished job 11.
14 of 20 steps (70%) done
[Tue Jun  6 21:24:43 2023]
Finished job 18.
15 of 20 steps (75%) done
[Wed Jun  7 19:36:35 2023]
Finished job 19.
16 of 20 steps (80%) done
smilesun commented 1 year ago

[Tue Jun 13 20:31:15 2023] Error in rule run_experiment: jobid: 0 input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv output: zoutput/benchmarks/pacs_benchmark/rule_results/12.csv

RuleException: RuntimeError in file /home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/benchmark.smk, line 121: Error(s) in loading state_dict for ModelHDUVA: size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zx.fc_loc.0.weight: copying a param with shape torch.Size([32, 179776]) from checkpoint, the shape in current model is torch.Size([64, 179776]). size mismatch for encoder.net_infer_zx.fc_loc.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for encoder.net_infer_zx.fc_scale.0.weight: copying a param with shape torch.Size([32, 179776]) from checkpoint, the shape in current model is torch.Size([64, 179776]). size mismatch for encoder.net_infer_zx.fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for encoder.net_infer_zy.net_fc_mean.0.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([32, 2048]). size mismatch for encoder.net_infer_zy.net_fc_mean.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zy.net_fc_scale.0.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([32, 2048]). size mismatch for encoder.net_infer_zy.net_fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for decoder.net_fc_z2flat_img.0.h.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 131]). size mismatch for decoder.net_fc_z2flat_img.0.g.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 131]). size mismatch for net_p_zy.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 7]) from checkpoint, the shape in current model is torch.Size([32, 7]). size mismatch for net_p_zy.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zy.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zy.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_classif_y.op_linear.weight: copying a param with shape torch.Size([7, 64]) from checkpoint, the shape in current model is torch.Size([7, 32]). size mismatch for net_p_zd.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 3]) from checkpoint, the shape in current model is torch.Size([32, 3]). size mismatch for net_p_zd.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/benchmark.smk", line 121, in __rule_run_experiment File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/run_experiment.py", line 109, in run_experiment File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/compos/exp/exp_main.py", line 70, in execute File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/trainers/a_trainer.py", line 106, in post_tr File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/observers/c_obvisitor_cleanup.py", line 12, in after_all File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/observers/b_obvisitor.py", line 71, in after_all File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/compos/exp/exp_utils.py", line 99, in load File "/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict File "/home/aih/xudong.sun/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 52, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

smilesun commented 12 months ago

maybe this will fix it https://github.com/marrlab/DomainLab/pull/468

smilesun commented 11 months ago

merged into master

Merge pull request https://github.com/marrlab/DomainLab/pull/468 from marrlab/xd_benchmark_fix_model_name3479847

@Car-la-F , could you test if it still occurs?

smilesun commented 11 months ago

fixed