Evaluation on Kubric - Githubissues

HarryHsing commented 1 year ago

Thank you very much for ur amazing work!

There's an error when evaluating on Kubric:

(tap) xingzhenghao@xingzhenghao-PC:~/PycharmProjects$ python ./tapnet/experiment.py --config ./tapnet/configs/tapnet_config.py --jaxline_mode=eval_kubric --config.checkpoint_dir=./tapnet/checkpoint/
I1218 16:02:29.348184 140603062212416 train.py:152] Evaluating with config:
best_model_eval_metric: ''
best_model_eval_metric_higher_is_better: true
checkpoint_dir: ./tapnet/checkpoint/
checkpoint_interval_type: null
dataset_names: &id001 !!python/tuple
- kubric
eval_initial_weights: true
eval_modes: &id002 !!python/tuple
- eval_davis_points
- eval_jhmdb
- eval_robotics_points
- eval_kinetics_points
evaluate_every: 10000
experiment_kwargs:
  config:
    checkpoint_dir: ./tapnet/checkpoint/
    datasets:
      dataset_names: *id001
      kubric_kwargs:
        batch_dims: 8
        shuffle_buffer_size: 128
        train_size: !!python/tuple
        - 256
        - 256
    davis_points_path: /home/xingzhenghao/PycharmProjects/datasets/tap/tapvid_davis/tapvid_davis.pkl
    eval_modes: *id002
    evaluate_every: 10000
    fast_variables: !!python/tuple []
    jhmdb_path: null
    optimizer:
      adam_kwargs:
        b1: 0.9
        b2: 0.95
        eps: 1.0e-08
      base_lr: 0.002
      cosine_decay_kwargs:
        end_value: 0.0
        init_value: 0.0
        warmup_steps: 5000
      max_norm: -1
      optimizer: adam
      schedule_type: cosine
      weight_decay: 0.01
    robotics_points_path: /home/xingzhenghao/PycharmProjects/datasets/tap/tapvid_rgb_stacking/tapvid_rgb_stacking.pkl
    save_final_checkpoint_as_npy: true
    shared_modules:
      shared_module_names: &id003 !!python/tuple
      - tapnet_model
      tapnet_model_kwargs: {}
    supervised_point_prediction_kwargs:
      prediction_algo: cost_volume_regressor
    sweep_name: default_sweep
    training:
      n_training_steps: 100000
interval_type: secs
log_all_train_data: false
log_tensors_interval: 60
log_train_data_interval: 120.0
logging_interval_type: null
max_checkpoints_to_keep: 5
one_off_evaluate: false
random_mode_eval: same_host_same_device
random_mode_train: unique_host_unique_device
random_seed: 42
save_checkpoint_interval: 10
shared_module_names: *id003
train_checkpoint_all_hosts: false
training_steps: 100000

I1218 16:02:29.355844 140603062212416 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I1218 16:02:29.422299 140603062212416 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host Interpreter CUDA
I1218 16:02:29.422796 140603062212416 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I1218 16:02:29.422965 140603062212416 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I1218 16:02:29.972896 140603062212416 supervised_point_prediction.py:944] Saving videos to ./tapnet/checkpoint/eval_kubric/100000
2022-12-18 16:02:29.987263: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
I1218 16:02:32.725086 140603062212416 dataset_info.py:491] Load dataset info from gs://kubric-public/tfds/movi_e/256x256/1.0.0
I1218 16:02:35.881139 140603062212416 dataset_info.py:550] Field info.splits from disk and from code do not match. Keeping the one from code.
I1218 16:02:36.213931 140603062212416 dataset_builder.py:383] Reusing dataset movi_e (gs://kubric-public/tfds/movi_e/256x256/1.0.0)
I1218 16:02:36.214255 140603062212416 logging_logger.py:44] Constructing tf.data.Dataset movi_e for split None, from gs://kubric-public/tfds/movi_e/256x256/1.0.0
W1218 16:02:39.021307 140603062212416 deprecation.py:337] From /home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
W1218 16:02:43.468899 140603062212416 deprecation.py:541] From /home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:1082: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version.
Instructions for updating:
box_ind is deprecated, use box_indices instead
2022-12-18 16:02:46.722060: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:828] shape_optimizer failed: INVALID_ARGUMENT: Subshape must have computed start >= end since stride is negative, but is 1 and 3 (computed from start 1 and end 9223372036854775807 over shape with rank 3 and stride-1)
2022-12-18 16:02:47.417004: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:828] shape_optimizer failed: INVALID_ARGUMENT: Subshape must have computed start >= end since stride is negative, but is 1 and 3 (computed from start 1 and end 9223372036854775807 over shape with rank 3 and stride-1)
Traceback (most recent call last):
  File "./tapnet/experiment.py", line 427, in <module>
    app.run(main)
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "./tapnet/experiment.py", line 420, in main
    platform.main(
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/jaxline/utils.py", line 484, in inner_wrapper
    return f(*args, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/jaxline/platform.py", line 137, in main
    train.evaluate(experiment_class, config, checkpointer, writer,
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/jaxline/utils.py", line 620, in inner_wrapper
    return fn(*args, **kwargs)
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/jaxline/train.py", line 225, in evaluate
    scalar_values = utils.evaluate_should_return_dict(experiment.evaluate)(
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/jaxline/utils.py", line 521, in evaluate_with_warning
    evaluate_out = f(*args, **kwargs)
  File "./tapnet/experiment.py", line 401, in evaluate
    eval_scalars = point_prediction_task.evaluate(
  File "/home/xingzhenghao/PycharmProjects/tapnet/supervised_point_prediction.py", line 495, in evaluate
    self._eval_epoch(
  File "/home/xingzhenghao/PycharmProjects/tapnet/supervised_point_prediction.py", line 968, in _eval_epoch
    for inputs in self._build_eval_input(mode):
  File "/home/xingzhenghao/PycharmProjects/tapnet/supervised_point_prediction.py", line 805, in _build_eval_input
    yield from evaluation_datasets.create_kubric_eval_dataset(mode)
  File "/home/xingzhenghao/PycharmProjects/tapnet/evaluation_datasets.py", line 463, in create_kubric_eval_dataset
    for data in np_ds:
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_utils.py", line 65, in _eager_dataset_iterator
    for elem in ds:
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 836, in __next__
    return self._next_internal()
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 819, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2923, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/home/xingzhenghao/anaconda3/envs/tap/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes at component 0: expected [?,256,24] but got [1,39,24]. [Op:IteratorGetNext]

Thanks a lot for ur support in advance!

cdoersch commented 1 year ago

Thanks for flagging this. I have a PR open for Kubric which should fix this bug: https://github.com/google-research/kubric/pull/266

If you could, please comment there so Klaus knows to apply it. I'll bug him about it again this week. In the meantime, you should be able to just patch it directly.

HarryHsing commented 1 year ago

Thanks for flagging this. I have a PR open for Kubric which should fix this bug: google-research/kubric#266

If you could, please comment there so Klaus knows to apply it. I'll bug him about it again this week. In the meantime, you should be able to just patch it directly.

Thanks very much for your quick response! This could fix the bug, I will comment in the PR.

I also wonder why each evaluated testing video has four parts with different tracking points, what are the differences between them?

Thanks!

google-deepmind / tapnet

Evaluation on Kubric #6