Closed smilesun closed 11 months ago
while in snakemake console, the snakemake stops making further submissions.
Select jobs to execute...
10, Task2, hduva, 333, sketch, 2, "{'gamma_y': 192400, 'zx_dim': 64, 'zy_dim': 64, 'zd_dim': 64}", 0.59274995, 0.60296017, 0.52567023, 0.929962, 0.52843946, 0.8426663
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, nvidia_gpu=1
Select jobs to execute...
reason: Missing output files: zoutput/benchmarks/pacs_benchmark/rule_results/10.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
wildcards: index=10
resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1
Submitted job 14 with external jobid 'Submitted batch job 11947037'.
[Tue Jun 6 21:33:15 2023]
Error in rule run_experiment:
jobid: 17
input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
cluster_jobid: Submitted batch job 11936875
Error executing rule run_experiment on cluster (jobid: 17, external: Submitted batch job 11936875, jobscript: /home/aih/xudong.sun/domainlab4matchdg/.snakemake/tmp.q1xxt1ea/snakejob.run_experiment.17.sh). For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 17.
Select jobs to execute...
[Tue Jun 6 21:33:15 2023]
rule run_experiment:
input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv
jobid: 17
reason: Missing output files: zoutput/benchmarks/pacs_benchmark/rule_results/13.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv
wildcards: index=13
resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1
Submitted job 17 with external jobid 'Submitted batch job 11949921'.
Submitted job 17 with external jobid 'Submitted batch job 11949921'.
[Wed Jun 7 14:24:10 2023]
Finished job 7.
2 of 20 steps (10%) done
[Wed Jun 7 19:21:47 2023]
Finished job 6.
3 of 20 steps (15%) done
[Wed Jun 7 23:53:59 2023]
Finished job 5.
4 of 20 steps (20%) done
[Thu Jun 8 00:31:29 2023]
Finished job 12.
5 of 20 steps (25%) done
[Thu Jun 8 00:55:59 2023]
Finished job 15.
6 of 20 steps (30%) done
[Thu Jun 8 01:22:02 2023]
Finished job 19.
7 of 20 steps (35%) done
[Thu Jun 8 02:01:41 2023]
Finished job 10.
8 of 20 steps (40%) done
[Thu Jun 8 04:01:49 2023]
Finished job 11.
9 of 20 steps (45%) done
[Thu Jun 8 05:08:29 2023]
Finished job 9.
10 of 20 steps (50%) done
[Thu Jun 8 08:49:49 2023]
Finished job 8.
11 of 20 steps (55%) done
[Thu Jun 8 08:55:07 2023]
Finished job 18.
12 of 20 steps (60%) done
[Thu Jun 8 09:55:37 2023]
Finished job 13.
13 of 20 steps (65%) done
[Thu Jun 8 12:23:34 2023]
Finished job 3.
14 of 20 steps (70%) done
[Thu Jun 8 13:16:02 2023]
Finished job 14.
15 of 20 steps (75%) done
[Thu Jun 8 18:39:06 2023]
Finished job 17.
16 of 20 steps (80%) done
From another run (from a different folder), similar error occurs while in the out file the last line read early stop encountered
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, nvidia_gpu=1
Select jobs to execute...
[Mon Jun 5 14:17:56 2023]
rule run_experiment:
input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
jobid: 0
reason: Missing output files: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
wildcards: index=15
resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=/tmp, partition=gpu_p, qos=gpu, nvidia_gpu=1
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
/home/aih/xudong.sun/DomainLab/domainlab/algos/trainers/train_visitor.py:31: UserWarning: hyper-parameter scheduler not set, going to use default Warmpup and epoch update
warnings.warn("hyper-parameter scheduler not set, \
[Mon Jun 5 21:42:32 2023]
Error in rule run_experiment:
jobid: 0
input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
RuleException:
RuntimeError in file /home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/benchmark.smk, line 121:
Error(s) in loading state_dict for ModelHDUVA:
size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for encoder.net_infer_zy.net_fc_mean.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
size mismatch for encoder.net_infer_zy.net_fc_mean.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for encoder.net_infer_zy.net_fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]).
size mismatch for encoder.net_infer_zy.net_fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for decoder.net_fc_z2flat_img.0.h.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 99]).
size mismatch for decoder.net_fc_z2flat_img.0.g.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 99]).
size mismatch for net_p_zy.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 7]) from checkpoint, the shape in current model is torch.Size([32, 7]).
size mismatch for net_p_zy.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zy.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zy.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zy.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zy.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
size mismatch for net_p_zy.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zy.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
size mismatch for net_p_zy.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_classif_y.op_linear.weight: copying a param with shape torch.Size([7, 64]) from checkpoint, the shape in current model is torch.Size([7, 32]).
size mismatch for net_p_zd.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 3]) from checkpoint, the shape in current model is torch.Size([32, 3]).
size mismatch for net_p_zd.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zd.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zd.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zd.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
size mismatch for net_p_zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for net_p_zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]).
size mismatch for net_p_zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
File "/home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/benchmark.smk", line 121, in __rule_run_experiment
File "/home/aih/xudong.sun/DomainLab/domainlab/exp_protocol/run_experiment.py", line 109, in run_experiment
File "/home/aih/xudong.sun/DomainLab/domainlab/compos/exp/exp_main.py", line 70, in execute
File "/home/aih/xudong.sun/DomainLab/domainlab/algos/trainers/a_trainer.py", line 106, in post_tr
File "/home/aih/xudong.sun/DomainLab/domainlab/algos/observers/c_obvisitor_cleanup.py", line 12, in after_all
File "/home/aih/xudong.sun/DomainLab/domainlab/algos/observers/b_obvisitor.py", line 61, in after_all
File "/home/aih/xudong.sun/DomainLab/domainlab/compos/exp/exp_utils.py", line 99, in load
File "/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
File "/home/aih/xudong.sun/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
while the snakemake log is
[Mon Jun 5 21:42:52 2023]
Error in rule run_experiment:
jobid: 19
input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
cluster_jobid: Submitted batch job 11886013
Error executing rule run_experiment on cluster (jobid: 19, external: Submitted batch job 11886013, jobscript: /home/aih/xudong.sun/DomainLab/.snakemake/tmp.zrmtxjva/snakejob.run_experiment.19.sh). For error details see the cluster log and the log files of the involved rule(s).
Trying to restart job 19.
Select jobs to execute...
[Mon Jun 5 21:42:52 2023]
rule run_experiment:
input: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
output: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv
jobid: 19
reason: Missing output files: zoutput/benchmarks/pacs_benchmark_big_gamma/rule_results/15.csv; Input files updated by another job: zoutput/benchmarks/pacs_benchmark_big_gamma/hyperparameters.csv
wildcards: index=15
resources: mem_mb=100000, mem_mib=95368, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, partition=gpu_p, qos=gpu, nvidia_gpu=1
Submitted job 19 with external jobid 'Submitted batch job 11896751'.
[Tue Jun 6 03:07:28 2023]
Finished job 9.
2 of 20 steps (10%) done
[Tue Jun 6 03:17:57 2023]
Finished job 3.
3 of 20 steps (15%) done
[Tue Jun 6 03:40:02 2023]
Finished job 14.
4 of 20 steps (20%) done
[Tue Jun 6 04:04:40 2023]
Finished job 6.
5 of 20 steps (25%) done
[Tue Jun 6 04:42:33 2023]
Finished job 10.
6 of 20 steps (30%) done
[Tue Jun 6 05:02:58 2023]
Finished job 16.
7 of 20 steps (35%) done
[Tue Jun 6 05:06:24 2023]
Finished job 8.
8 of 20 steps (40%) done
[Tue Jun 6 08:48:47 2023]
Finished job 5.
9 of 20 steps (45%) done
[Tue Jun 6 09:00:37 2023]
Finished job 7.
10 of 20 steps (50%) done
[Tue Jun 6 10:37:21 2023]
Finished job 13.
11 of 20 steps (55%) done
[Tue Jun 6 16:31:37 2023]
Finished job 12.
12 of 20 steps (60%) done
[Tue Jun 6 19:04:41 2023]
Finished job 15.
13 of 20 steps (65%) done
[Tue Jun 6 20:26:16 2023]
Finished job 11.
14 of 20 steps (70%) done
[Tue Jun 6 21:24:43 2023]
Finished job 18.
15 of 20 steps (75%) done
[Wed Jun 7 19:36:35 2023]
Finished job 19.
16 of 20 steps (80%) done
[Tue Jun 13 20:31:15 2023] Error in rule run_experiment: jobid: 0 input: zoutput/benchmarks/pacs_benchmark/hyperparameters.csv output: zoutput/benchmarks/pacs_benchmark/rule_results/12.csv
RuleException: RuntimeError in file /home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/benchmark.smk, line 121: Error(s) in loading state_dict for ModelHDUVA: size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 4096]) from checkpoint, the shape in current model is torch.Size([32, 4096]). size mismatch for encoder.net_infer_zd_topic.imgtopic2zd.encoder_cat_topic_img_h2zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zx.fc_loc.0.weight: copying a param with shape torch.Size([32, 179776]) from checkpoint, the shape in current model is torch.Size([64, 179776]). size mismatch for encoder.net_infer_zx.fc_loc.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for encoder.net_infer_zx.fc_scale.0.weight: copying a param with shape torch.Size([32, 179776]) from checkpoint, the shape in current model is torch.Size([64, 179776]). size mismatch for encoder.net_infer_zx.fc_scale.0.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for encoder.net_infer_zy.net_fc_mean.0.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([32, 2048]). size mismatch for encoder.net_infer_zy.net_fc_mean.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for encoder.net_infer_zy.net_fc_scale.0.weight: copying a param with shape torch.Size([64, 2048]) from checkpoint, the shape in current model is torch.Size([32, 2048]). size mismatch for encoder.net_infer_zy.net_fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for decoder.net_fc_z2flat_img.0.h.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 131]). size mismatch for decoder.net_fc_z2flat_img.0.g.weight: copying a param with shape torch.Size([150528, 163]) from checkpoint, the shape in current model is torch.Size([150528, 131]). size mismatch for net_p_zy.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 7]) from checkpoint, the shape in current model is torch.Size([32, 7]). size mismatch for net_p_zy.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zy.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zy.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zy.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_classif_y.op_linear.weight: copying a param with shape torch.Size([7, 64]) from checkpoint, the shape in current model is torch.Size([7, 32]). size mismatch for net_p_zd.net_linear_bn_relu.0.weight: copying a param with shape torch.Size([64, 3]) from checkpoint, the shape in current model is torch.Size([32, 3]). size mismatch for net_p_zd.net_linear_bn_relu.1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.net_linear_bn_relu.1.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.fc_loc.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zd.fc_loc.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for net_p_zd.fc_scale.0.weight: copying a param with shape torch.Size([64, 64]) from checkpoint, the shape in current model is torch.Size([32, 32]). size mismatch for net_p_zd.fc_scale.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]). File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/benchmark.smk", line 121, in __rule_run_experiment File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/exp_protocol/run_experiment.py", line 109, in run_experiment File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/compos/exp/exp_main.py", line 70, in execute File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/trainers/a_trainer.py", line 106, in post_tr File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/observers/c_obvisitor_cleanup.py", line 12, in after_all File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/algos/observers/b_obvisitor.py", line 71, in after_all File "/home/aih/xudong.sun/domainlab_correct_pacs_split/domainlab/compos/exp/exp_utils.py", line 99, in load File "/home/aih/xudong.sun/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict File "/home/aih/xudong.sun/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 52, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
maybe this will fix it https://github.com/marrlab/DomainLab/pull/468
merged into master
Merge pull request https://github.com/marrlab/DomainLab/pull/468 from marrlab/xd_benchmark_fix_model_name … 3479847
@Car-la-F , could you test if it still occurs?
fixed
currently, the code is like this:
The error to report here is that in afterall, the model loaded does not match the model in memory in size for diva and hduva, see below
in slurm log file:
run_experiment-index=13-11936875.out
the output file displays no error but with early stop finished.
with
but in the error file:
run_experiment-index=13-11936875.err