Shape pre-trained stage error

tsuiiiii commented 10 months ago

im trying to run the code on dtu scan without mvs shape. I encountered a shape mismatch when enumerating the dataset. I followed the instructions on trainning vanilla nerf and computing geometry buffers for dtu scan.

`2023-10-28 23:21:29.816339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2023-10-28 23:21:29.863132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2023-10-28 23:21:29.874349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2023-10-28 23:21:30.084292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2023-10-28 23:21:30.248526: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2023-10-28 23:21:30.293372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2023-10-28 23:21:30.489140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2023-10-28 23:21:30.528962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2023-10-28 23:21:30.827244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2023-10-28 23:21:30.828401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2023-10-28 23:21:30.829422: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2023-10-28 23:21:30.857546: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2400105000 Hz 2023-10-28 23:21:30.859352: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c210dd2680 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2023-10-28 23:21:30.859387: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2023-10-28 23:21:30.961714: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c210dd5100 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-10-28 23:21:30.961809: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0 2023-10-28 23:21:30.963299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:85:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2023-10-28 23:21:30.963357: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2023-10-28 23:21:30.963389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2023-10-28 23:21:30.963417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2023-10-28 23:21:30.963444: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2023-10-28 23:21:30.963470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2023-10-28 23:21:30.963504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2023-10-28 23:21:30.963531: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2023-10-28 23:21:30.964088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2023-10-28 23:21:30.964137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2023-10-28 23:21:30.967326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-10-28 23:21:30.967377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 2023-10-28 23:21:30.967403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N 2023-10-28 23:21:30.968545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14902 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0) INFO:tensorflow:[32mUsing MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m I1028 23:21:30.974194 47943187769408 mirrored_strategy.py:500] [32m[32mUsing MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)[0m[0m [36m[util/io] Output directory already exisits: /project2/tsui/nerfactor/output/train/scan37_shape/lr1e-2[0m [35m[util/io] Output directory wiped: /project2/tsui/nerfactor/output/train/scan37_shape/lr1e-2[0m [36m[trainvali] For results, see: /project2/tsui/nerfactor/output/train/scan37_shape/lr1e-2[0m [36m[datasets/nerf_shape] Number of 'train' views: 47[0m [36m[datasets/nerf_shape] Number of 'vali' views: 2[0m [36m[models/base] Trainable layers registered: ['net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0'][0m [36m[trainvali] Started from scratch[0m

Training epochs: 0%| | 0/200 [00:00<?, ?it/s]2023-10-28 23:21:35.393537: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341,512] 2023-10-28 23:21:35.393675: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341,3] 2023-10-28 23:21:35.393760: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341,3] 2023-10-28 23:21:35.393836: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341] 2023-10-28 23:21:35.393917: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341,3] 2023-10-28 23:21:35.393997: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: indices[1022] = [60, 497] does not index into param shape [256,341,3]

Training epochs: 0%| | 0/200 [00:00<?, ?it/s] shape: <tensorflow.python.distribute.input_lib.DistributedDataset object at 0x2b9b1fe38518> Traceback (most recent call last): File "/project2/tsui/nerfactor/code/nerfactor/trainvali.py", line 342, in app.run(main) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/project2/tsui/nerfactor/code/nerfactor/trainvali.py", line 179, in main for batch_i, batch in enumerate(datapipe_train): File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next return self.get_next() File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 328, in get_next global_has_value, replicas = _get_next_as_optional(self, self._strategy) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 192, in _get_next_as_optional iterator._iterators[i].get_next_as_list(new_name)) # pylint: disable=protected-access File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/distribute/input_lib.py", line 1132, in get_next_as_list data_list = self._iterator.get_next_as_optional() File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 601, in get_next_as_optional iterator_ops.get_next_as_optional(self._device_iterators[i])) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 833, in get_next_as_optional iterator.element_spec)), iterator.element_spec) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2444, in iterator_get_next_as_optional _ops.raise_from_not_ok_status(e, name) File "/home/tsui/project/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1022] = [60, 497] does not index into param shape [256,341,512] [[{{node GatherNd_7}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [Op:IteratorGetNextAsOptional] ` what could be the problem? Thanks

xiumingzhang commented 10 months ago

[60, 497] does not index into param shape [256,341,3]

Looks like a shape mismatch to me. I recommend debugging with the non-distributed version so that you can insert a breakpoint there and inspect the tensor shape. Feel free to reopen this if the issue persists.

tsuiiiii commented 10 months ago

[60, 497] does not index into param shape [256,341,3]

Looks like a shape mismatch to me. I recommend debugging with the non-distributed version so that you can insert a breakpoint there and inspect the tensor shape. Feel free to reopen this if the issue persists.

It turns out to be my mistake. I think I may wrongly set the imh as 512 when generating the dataset.

# Resize
        if imh != xyz.shape[0]:
            xyz = xm.img.resize(xyz, new_h=imh)
            normal = xm.img.resize(normal, new_h=imh)
            lvis = xm.img.resize(lvis, new_h=imh)
            alpha = xm.img.resize(alpha, new_h=imh)
            rgb = xm.img.resize(rgb, new_h=imh)
        rgb = xm.img.resize(rgb, new_h=imh)

After I forced the rgb to resize everything went okay. Thanks anyway!

google / nerfactor

Shape pre-trained stage error #41