google-deepmind / tapnet

Tracking Any Point (TAP)
https://deepmind-tapir.github.io/blogpost.html
Apache License 2.0
1.31k stars 124 forks source link

Error when evaluating on Kubric dataset #14

Closed arjunvb closed 1 year ago

arjunvb commented 1 year ago

I'm trying to evaluate TAP-Net on the Kubric dataset, and I'm getting the error shown below. I am running the following script: python3 ./tapnet/experiment.py --config=./tapnet/configs/tapnet_config.py --jaxline_mode=eval_kubric --config.checkpoint_dir=/data3/tap/tap/tapnet_checkpoint/.

Any idea how to fix this? Thanks!

I0228 19:56:23.706737 140398657533120 train.py:152] Evaluating with config:
best_model_eval_metric: ''
best_model_eval_metric_higher_is_better: true
checkpoint_dir: /data3/tap//tap/tapnet_checkpoint/
checkpoint_interval_type: null
dataset_names: &id001 !!python/tuple
- kubric
eval_initial_weights: true
eval_modes: &id002 !!python/tuple
- eval_davis_points
- eval_jhmdb
- eval_robotics_points
- eval_kinetics_points
evaluate_every: 10000
experiment_kwargs:
  config:
    checkpoint_dir: /data3/tap//tap/tapnet_checkpoint/
    datasets:
      dataset_names: *id001
      kubric_kwargs:
        batch_dims: 8
        shuffle_buffer_size: 128
        train_size: !!python/tuple
        - 256
        - 256
    davis_points_path: ''
    eval_modes: *id002
    evaluate_every: 10000
    fast_variables: !!python/tuple []
    inference:
      input_video_path: ''
      num_points: 20
      output_video_path: ''
      resize_height: 256
      resize_width: 256
    jhmdb_path: ''
    optimizer:
      adam_kwargs:
        b1: 0.9
        b2: 0.95
        eps: 1.0e-08
      base_lr: 0.002
      cosine_decay_kwargs:
        end_value: 0.0
        init_value: 0.0
        warmup_steps: 5000
      max_norm: -1
      optimizer: adam
      schedule_type: cosine
      weight_decay: 0.01
    robotics_points_path: ''
    save_final_checkpoint_as_npy: true
    shared_modules:
      shared_module_names: &id003 !!python/tuple
      - tapnet_model
      tapnet_model_kwargs: {}
    supervised_point_prediction_kwargs:
      prediction_algo: cost_volume_regressor
    sweep_name: default_sweep
    training:
      n_training_steps: 100000
interval_type: secs
log_all_train_data: false
log_tensors_interval: 60
log_train_data_interval: 120.0
logging_interval_type: null
max_checkpoints_to_keep: 5
one_off_evaluate: false
random_mode_eval: same_host_same_device
random_mode_train: unique_host_unique_device
random_seed: 42
save_checkpoint_interval: 10
shared_module_names: *id003
train_checkpoint_all_hosts: false
training_steps: 100000

I0228 19:56:23.755014 140398657533120 xla_bridge.py:173] Remote TPU is not linked into jax; skipping remote TPU.
I0228 19:56:23.755347 140398657533120 xla_bridge.py:357] Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'
I0228 19:56:24.424445 140398657533120 xla_bridge.py:357] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter Host CUDA
I0228 19:56:24.425570 140398657533120 xla_bridge.py:357] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0228 19:56:28.558446 140398657533120 supervised_point_prediction.py:979] Saving videos to /data3/tap//tap/tapnet_checkpoint/eval_kubric/0
I0228 19:56:28.567507 140398657533120 dataset_info.py:565] Load dataset info from /data3/tap/kubric/movi_e/256x256/1.0.0
W0228 19:56:28.572742 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'uint8'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to uint8.
W0228 19:56:28.574135 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'uint16'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to uint16.
I0228 19:56:28.620231 140398657533120 dataset_info.py:654] Fields info.[splits] from disk and from code do not match. Keeping the one from code.
I0228 19:56:28.620935 140398657533120 dataset_builder.py:522] Reusing dataset movi_e (/data3/tap/kubric/movi_e/256x256/1.0.0)
W0228 19:56:28.622349 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.622643 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'float32'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to float32.
W0228 19:56:28.622788 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.622916 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.623071 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'int32'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int32.
W0228 19:56:28.623173 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.623292 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.623408 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.623907 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.624111 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'string'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to object.
W0228 19:56:28.624262 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.624418 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.624617 140398657533120 feature.py:64] `TensorInfo.dtype` is deprecated. Please change your code to use NumPy with the field `TensorInfo.np_dtype` or use TensorFlow with the field `TensorInfo.tf_dtype`.
W0228 19:56:28.625051 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
W0228 19:56:28.625315 140398657533120 dtype_utils.py:43] You use TensorFlow DType <dtype: 'bool'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to bool.
I0228 19:56:29.712036 140398657533120 logging_logger.py:49] Constructing tf.data.Dataset movi_e for split None, from /data3/tap/kubric/movi_e/256x256/1.0.0
W0228 19:56:32.420510 140398657533120 deprecation.py:337] From /data/anaconda3/envs/tapnet/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
W0228 19:56:39.477390 140398657533120 deprecation.py:541] From /data/anaconda3/envs/tapnet/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version.
Instructions for updating:
box_ind is deprecated, use box_indices instead
2023-02-28 19:56:44.292071: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:828] shape_optimizer failed: INVALID_ARGUMENT: Subshape must have computed start >= end since stride is negative, but is 1 and 3 (computed from start 1 and end 9223372036854775807 over shape with rank 3 and stride-1)
2023-02-28 19:56:45.191140: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:828] shape_optimizer failed: INVALID_ARGUMENT: Subshape must have computed start >= end since stride is negative, but is 1 and 3 (computed from start 1 and end 9223372036854775807 over shape with rank 3 and stride-1)
Traceback (most recent call last):
  File "./tapnet/experiment.py", line 429, in <module>
    app.run(main)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "./tapnet/experiment.py", line 421, in main
    platform.main(
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/jaxline/utils.py", line 484, in inner_wrapper
    return f(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/jaxline/platform.py", line 137, in main
    train.evaluate(experiment_class, config, checkpointer, writer,
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/jaxline/utils.py", line 620, in inner_wrapper
    return fn(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/jaxline/train.py", line 225, in evaluate
    scalar_values = utils.evaluate_should_return_dict(experiment.evaluate)(
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/jaxline/utils.py", line 521, in evaluate_with_warning
    evaluate_out = f(*args, **kwargs)
  File "./tapnet/experiment.py", line 404, in evaluate
    eval_scalars = point_prediction_task.evaluate(
  File "/home/ubuntu/contractive/tapnet/supervised_point_prediction.py", line 514, in evaluate
    self._eval_epoch(
  File "/home/ubuntu/contractive/tapnet/supervised_point_prediction.py", line 1005, in _eval_epoch
    scalars, viz = eval_batch_fn(params, state, inputs, rng)
  File "/home/ubuntu/contractive/tapnet/supervised_point_prediction.py", line 766, in _eval_batch
    occlusion_logits, tracks, loss_scalars = self._infer_batch(
  File "/home/ubuntu/contractive/tapnet/supervised_point_prediction.py", line 577, in _infer_batch
    output, _ = functools.partial(
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/transform.py", line 357, in apply_fn
    out = f(*args, **kwargs)
  File "./tapnet/experiment.py", line 122, in forward
    return self.point_prediction.forward_fn(
  File "/home/ubuntu/contractive/tapnet/supervised_point_prediction.py", line 313, in forward_fn
    return shared_modules['tapnet_model'](
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 426, in wrapped
    out = f(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 272, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/ubuntu/contractive/tapnet/tapnet_model.py", line 341, in __call__
    latent = self.tsm_resnet(
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 426, in wrapped
    out = f(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 272, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/ubuntu/contractive/tapnet/models/tsm_resnet.py", line 383, in __call__
    net = hk.Conv2D(
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 426, in wrapped
    out = f(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/module.py", line 272, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/conv.py", line 200, in __call__
    w = hk.get_parameter("w", w_shape, inputs.dtype, init=w_init)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/base.py", line 448, in wrapped
    return wrapped._current(*args, **kwargs)
  File "/data/anaconda3/envs/tapnet/lib/python3.8/site-packages/haiku/_src/base.py", line 524, in get_parameter
    raise ValueError(
ValueError: Unable to retrieve parameter 'w' for module 'tap_net/~/tsm_resnet_video/tsm_resnet_stem' All parameters must be created as part of `init`.
yangyi02 commented 1 year ago

Could you verify if /data3/tap/tap/tapnet_checkpoint/checkpoint.npy exists? Note that the file name has to be checkpoint.npy

arjunvb commented 1 year ago

Yes, the path was wrong. Thanks for catching that!