modestyachts / imagenet-testbed

ImageNet Testbed, associated with the paper "Measuring Robustness to Natural Distribution Shifts in Image Classification."
https://modestyachts.github.io/imagenet-testbed/
MIT License
116 stars 7 forks source link

Some model checkpoints are expired or not available #14

Open vishaal27 opened 9 months ago

vishaal27 commented 9 months ago

Hey, I was running model evaluations on my own custom data-split for all models in the registry using:

python eval.py --gpus 0 --models <model> --eval-settings custom_dataset

where <model> comes from all the models in the registry (python db.py --list-models-registry). However, for many of the models, I see a pickling error due to the checkpoint not being loaded correctly. See stack-trace below:

Traceback (most recent call last):                                                       
  File "/lib/python3.8/site-packages/
torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)
  File "imagenet-testbed/src/inference.py", line 64, in main_worker
    model = py_model.generate_classifier(py_eval_setting)
  File "imagenet-testbed/src/models/model_base.py", line 76, in generate_classifier
    self.classifier = self.classifier_loader()
  File "imagenet-testbed/src/models/low_accuracy.py", line 100, in load_resnet
    load_model_state_dict(net, model_name)
  File "imagenet-testbed/src/mldb/utils.py",
line 98, in load_model_state_dict
    state_dict = torch.load(bio, map_location=f'cpu')
  File "/lib/python3.8/site-packages/
torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

I see that this error happens for all of the low-resource models like resnet18_100k_x_epochs, resnet18_50k_x_epochs etc. To fully ensure this is not an artefact of my own custom data-split, I also tested this on the imagenet-val split with no success. Are the low-resource models not available as checkpoints from the server?

Also, another set of errors I get when running this is due to some checkpoints still being stored on the vasa endpoint, see:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://vasa.millennium.berkeley.edu:9000/robustness-eval/checkpoints/3NL5sQy84F9nefxVCVDzew_data.bytes"

Are some of the checkpoints not migrated fully yet?

Sorry for the long verbose issue, but hope we can get this resolved :)

rtaori commented 9 months ago

Hi @vishaal27, Unfortunately some checkpoints are not online as they are on vasa and have not been migrated to the gcloud bucket yet. I'm not sure if/when they'll come online, as the path to migrating them is not so straightforward as I have lost my berkeley access now :)

vishaal27 commented 9 months ago

Hey @rtaori, thanks for your blazingly fast response! Is there anyone else with access who would be able to check this?

rtaori commented 9 months ago

Potentially, let me check. But if you don't hear back within the next week, then probably there's no way to get these checkpoints :(

vishaal27 commented 9 months ago

Sure thanks for checking, really appreciate this :)