google-research / meta-dataset

A dataset of datasets for learning to learn from few examples
Apache License 2.0
764 stars 139 forks source link

Weird Type error when running meta-dataset.train to run the meta_dataset/learn/gin/default/baseline_cosine_imagenet.gin configuration file #66

Open brenowca opened 3 years ago

brenowca commented 3 years ago

Hi I am trying to run the baseline cosine method on ImageNet but I got the following mysterious error?

`TypeError: 'int' object is not subscriptable
  In call to configurable 'four_layer_convnet' (<function four_layer_convnet at 0x7f55794b99d0>)
  In call to configurable 'Trainer' (<class 'meta_dataset.trainer.Trainer'>)`

I downloaded the ImageNet dataset and converted it to records as described in this instruction file.

Could you please help me find what is going wrong?

My script call:

python -m meta_dataset.train  \
    --train_checkpoint_dir=brenow/bench \
    --summary_dir=brenow/bench \
    --records_root_dir=brenow/multipletasklearning/datasets/meta_dataset/records/ \
    --alsologtostderr --gin_config=meta_dataset/learn/gin/default/baseline_cosine_imagenet.gin \
    --gin_bindings="Trainer.experiment_name='baseline_cosine_imagenet'"

The entire error log:

2021-05-11 18:14:11.298088: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From brenow/multipletasklearning/meta-dataset/meta_dataset/models/experimental/reparameterizable_backbones.py:39: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

I0511 18:14:24.676703 140008958826304 trainer.py:893] Adding dataset ilsvrc_2012
I0511 18:14:24.677727 140008958826304 trainer.py:918] Episodes for split valid will be created from ['ilsvrc_2012']
I0511 18:14:24.677808 140008958826304 trainer.py:918] Episodes for split train will be created from ['ilsvrc_2012']
I0511 18:14:33.676376 140008958826304 api.py:461] batch augmentations:
I0511 18:14:34.871964 140008958826304 api.py:461] enable_jitter: True
I0511 18:14:34.879687 140008958826304 api.py:461] jitter_amount: 0
I0511 18:14:34.887233 140008958826304 api.py:461] enable_gaussian_noise: True
I0511 18:14:34.895139 140008958826304 api.py:461] gaussian_noise_std: 0.0
WARNING:tensorflow:From brenow/multipletasklearning/meta-dataset/meta_dataset/data/pipeline.py:355: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0511 18:14:34.901691 140008958826304 deprecation.py:531] From brenow/multipletasklearning/meta-dataset/meta_dataset/data/pipeline.py:355: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
2021-05-11 18:14:35.180736: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-11 18:14:35.184715: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-11 18:14:35.236975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:1b:00.0 name: Tesla K40m computeCapability: 3.5
coreClock: 0.745GHz coreCount: 15 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 268.58GiB/s
2021-05-11 18:14:35.237754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:86:00.0 name: Tesla K40m computeCapability: 3.5
coreClock: 0.745GHz coreCount: 15 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 268.58GiB/s
2021-05-11 18:14:35.237800: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-11 18:14:35.242303: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-11 18:14:35.242358: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-11 18:14:35.243427: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-11 18:14:35.244122: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-11 18:14:35.244313: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ibm/lsfsuite/ext/ppm/10.2/linux2.6-glibc2.3-x86_64/lib:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/lib
2021-05-11 18:14:35.245100: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-05-11 18:14:35.245258: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ibm/lsfsuite/ext/ppm/10.2/linux2.6-glibc2.3-x86_64/lib:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/lib
2021-05-11 18:14:35.245285: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-05-11 18:14:35.246996: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-11 18:14:35.247044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-11 18:14:35.247056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      
WARNING:tensorflow:From brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:2560: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_types is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
W0511 18:14:37.684212 140008958826304 deprecation.py:531] From brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:2560: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_types is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
WARNING:tensorflow:From brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:2560: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
W0511 18:14:37.684439 140008958826304 deprecation.py:531] From brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:2560: calling DatasetV2.from_generator (from tensorflow.python.data.ops.dataset_ops) with output_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Use output_signature instead
I0511 18:14:37.938055 140008958826304 api.py:461] support augmentations:
I0511 18:14:37.945658 140008958826304 api.py:461] enable_jitter: True
I0511 18:14:37.953278 140008958826304 api.py:461] jitter_amount: 0
I0511 18:14:37.961421 140008958826304 api.py:461] enable_gaussian_noise: True
I0511 18:14:37.969164 140008958826304 api.py:461] gaussian_noise_std: 0.0
I0511 18:14:37.977154 140008958826304 api.py:461] query augmentations:
I0511 18:14:37.984944 140008958826304 api.py:461] enable_jitter: False
I0511 18:14:37.992632 140008958826304 api.py:461] jitter_amount: 0
I0511 18:14:38.000089 140008958826304 api.py:461] enable_gaussian_noise: False
I0511 18:14:38.007490 140008958826304 api.py:461] gaussian_noise_std: 0.0
Traceback (most recent call last):
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/train.py", line 273, in <module>
    app.run(program)
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/train.py", line 210, in main
    trainer_instance = trainer.Trainer(
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/config.py", line 1069, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/config.py", line 1046, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/trainer.py", line 562, in __init__
    output = self.run_fns[split](data_tensors)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/trainer.py", line 1325, in run_fn_with_train_op
    res = run_fn(data)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/trainer.py", line 695, in run
    predictions_dist = self.learners[split].forward_pass(data_local)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/learners/baseline_learners.py", line 73, in forward_pass
    embeddings_params_moments = self.embedding_fn(images, self.is_training)
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/config.py", line 1069, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "brenow/miniconda3/envs/multipletasklearning/lib/python3.8/site-packages/gin/config.py", line 1046, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/models/functional_backbones.py", line 953, in four_layer_convnet
    return _four_layer_convnet(
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/models/functional_backbones.py", line 880, in _four_layer_convnet
    layer, conv_bn_params, conv_bn_moments = conv_bn(
  File "brenow/multipletasklearning/meta-dataset/meta_dataset/models/functional_backbones.py", line 369, in conv_bn
    depth[0],
TypeError: 'int' object is not subscriptable
  In call to configurable 'four_layer_convnet' (<function four_layer_convnet at 0x7f55794b99d0>)
  In call to configurable 'Trainer' (<class 'meta_dataset.trainer.Trainer'>)
brenowca commented 3 years ago

Obs.: I got the same error message when running on GPU or CPU

brenowca commented 3 years ago

My $DATASRC/records/ilsvrc_2012/dataset_spec.json file matches most of the original ilsvrc_2012_dataset_spec.json file in the repository.

The only difference is the training split that seems to have being generated at random:

image

brenowca commented 3 years ago

ilsvrc_json_diff_ilsvrc_original.txt This is the complete diff file, just for reference.

Oh, I had the same issue using the crosstransformer_simclreps_imagenet.gin configuration file

brenowca commented 3 years ago

Hi team, just a quick update on this issue:

I checked out an older commit, 0c8a9bb, and this error disappeared.

This is the following git checkout command that I ran: git checkout 0c8a9bb

I chose this one because it was the last commit I knew for sure that where able to run the CTX code since another person was able to create a checkpoint for CTX using it. Check issue #58 for the mentioned checkpoint.