Recurring error when running train_samurai.py | Batch size must be greater than zero. [Op:BatchDatasetV2]

SherifGabr commented 1 year ago

Rendering last datapoint
Traceback (most recent call last):
  File "/scratch/project_2007011/samurai/train_samurai.py", line 1442, in <module>
    main(args)
  File "/scratch/project_2007011/samurai/train_samurai.py", line 536, in main
    render_full_datapoint(
  File "/scratch/project_2007011/samurai/train_samurai.py", line 739, in render_full_datapoint
    fine_result = samurai.distributed_call(
  File "/scratch/project_2007011/samurai/models/samurai/samurai_model.py", line 361, in distributed_call
    tf.data.Dataset.from_tensor_slices(
  File "/projappl/project_2007011/miniconda3/envs/samurai/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1714, in batch
    return BatchDataset(self, batch_size, drop_remainder, name=name)
  File "/projappl/project_2007011/miniconda3/envs/samurai/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4897, in __init__
    variant_tensor = gen_dataset_ops.batch_dataset_v2(
  File "/projappl/project_2007011/miniconda3/envs/samurai/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 685, in batch_dataset_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/projappl/project_2007011/miniconda3/envs/samurai/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Batch size must be greater than zero. [Op:BatchDatasetV2]

Encountered this error after running the following command python train_samurai.py --config configs/samurai/samurai.txt --datadir fire_engine/ --basedir ../fire_engine_train/ --expname exp1 --gpu=0 Tried with other scenes (duck) but the same error persists.

SherifGabr commented 1 year ago

The issue is resolved now. It may be caused by line 372 in samurai_model.py

.batch(chunk_size * get_num_gpus())

Where if the number of GPUs is 0, the batch size will be 0 and so produces the error. It seems that $LD_LIBRARY_PATH was incorrectly set and caused the model to not utilize any GPUs for training.

monajalal commented 1 year ago

@SherifGabr it doesn't seem the current version of code is still correct.

https://github.com/google/samurai/issues/9

SherifGabr commented 1 year ago

@monajalal Did you correctly set the LD_LIBRARY_PATH? IIRC after setting it, it worked with no issues. If that doesn't work, I would check the CUDA drivers. Also, I trained on 1 GPU (Nvidia V100), could it be that you are training on multiple GPUs and it somehow fails?

google / samurai

Recurring error when running train_samurai.py | Batch size must be greater than zero. [Op:BatchDatasetV2] #4