Error during executing evaluate.py

Nireil commented 3 years ago

When I run torchpack dist-run -np 1 python evaluate.py configs/semantic_kitti/default.yaml --name SemanticKITTI_val_SPVNAS@65GMACs then I got following error messages :

[2021-08-30 17:23:50.180] /usr/local/anaconda3/envs/torch/bin/python evaluate.py configs/semantic_kitti/default.yaml --name SemanticKITTI_val_SPVNAS@65GMACs
[2021-08-30 17:23:50.181] Experiment started: "runs/run-98ebafa2-a0dc3bdc".
workers_per_gpu: 8
data:
  num_classes: 19
  ignore_label: 255
  training_size: 19132
train:
  seed: 1588147245
  deterministic: False
dataset:
  name: semantic_kitti
  root: /dataset/semantic-kitti
  num_points: 80000
  voxel_size: 0.05
num_epochs: 15
batch_size: 2
criterion:
  name: cross_entropy
  ignore_index: 255
optimizer:
  name: sgd
  lr: 0.24
  weight_decay: 0.0001
  momentum: 0.9
  nesterov: True
scheduler:
  name: cosine_warmup
Traceback (most recent call last):
  File "evaluate.py", line 130, in <module>
    main()
  File "evaluate.py", line 62, in main
    model = spvnas_specialized(args.name)
  File "/home/pjy/spvnas/model_zoo.py", line 51, in spvnas_specialized
    if torch.cuda.is_available() else 'cpu')['model']
  File "/usr/local/anaconda3/envs/torch/lib/python3.6/site-packages/torch/serialization.py", line 587, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/anaconda3/envs/torch/lib/python3.6/site-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59182,1],0]
  Exit code:    1
--------------------------------------------------------------------------

but I can run torchpack dist-run -np [num_of_gpus] python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml successfully, and i got the best mIoU of 59.466 on one GTX 1080Ti GPU

zhijian-liu commented 3 years ago

From https://github.com/pytorch/pytorch/issues/31620, it seems that the checkpoint file is corrupted. Could you please try to directly load the checkpoint by torch.load to see whether it works? Thanks!

zhijian-liu commented 3 years ago

I'm closing this issue due to inactivity. Please feel free to reopen it if the problem has not been resolved.

mit-han-lab / spvnas

Error during executing evaluate.py #74