google / nerfactor

Neural Factorization of Shape and Reflectance Under an Unknown Illumination
https://xiuming.info/projects/nerfactor/
Apache License 2.0
440 stars 56 forks source link

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

Closed cjw531 closed 2 years ago

cjw531 commented 3 years ago

Hi, With your debugging help done in other issues, I was able to get up to the last step.

I have 3 questions about this last step:

  1. If I use single 2080ti here (as you set gpus='0'), I am getting OOM allocation error so I assigned three 2080ti here. Is this an acceptable approach? Because you did not seem to allow allocating multi-gpus for calculating geometry buffers. Also, should I consider using imh=256 instead of 512 to reduce memory usage? Error message as follows:

    tensorflow.python.framework.errors_impl.ResourceExhaustedError: 
    OOM when allocating tensor with shape[68361728,3] and type float on /job:localhost/replica:0/t
    ask:0/device:GPU:0 by allocator GPU_0_bfc [Op:Mul]
  2. What I initially did was copying the whole script (step 1, 2, and 3) of the final step, and running it with $ bash ./script.sh. However this causes the error saying that it cannot find the ckpt-2 and ckpt-10 files that should be pre-existed. So I separated three scripts, and was able to get up to the shape pre-training and joint optimization process. I hope my execution did not cause the below tensorflow warning regarding:

    The calling iterator did not fully read the dataset being cached. 
    In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. 
    This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. 
    You should use `dataset.take(k).cache().repeat()` instead.
  3. I am getting the following error in the very last step and cannot complete your hotdog example ("Simultaneous Relighting and View Synthesis (testing)"):

    [test] Restoring trained model
    [models/base] Trainable layers registered:
        ['net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
    [models/base] Trainable layers registered:
        ['net_brdf_mlp_layer0', 'net_brdf_mlp_layer1', 'net_brdf_mlp_layer2', 'net_brdf_mlp_layer3', 'net_brdf_out_layer0']
    [models/base] Trainable layers registered:
        ['net_albedo_mlp_layer0', 'net_albedo_mlp_layer1', 'net_albedo_mlp_layer2', 'net_albedo_mlp_layer3', 'net_albedo_out_layer0', 'net_brdf_z_mlp_layer0', 'net_brdf_z_mlp_layer1', 'net_brdf_z_mlp_layer2', 'net_brdf_z_mlp_layer3', 'net_brdf_z_out_layer0', 'net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
    [test] Running inference
    Inferring Views:   0%|                                                     | 0/200 [00:00<?, ?it/s]
    2021-09-14 01:46:33.905210: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
    2021-09-14 01:47:05.401366: W tensorflow/core/kernels/data/cache_dataset_ops.cc:794] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
    Inferring Views:   0%|                                                     | 0/200 [02:22<?, ?it/s]
    Traceback (most recent call last):
    File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 209, in <module>
    app.run(main)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
    File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 192, in main
    brdf_z_override=brdf_z_override)
    File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 266, in call
    relight_probes=relight_probes)
    File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 362, in _render
    rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1606, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1181, in concat_v2
    _ops.raise_from_not_ok_status(e, name)
    File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
    File "<string>", line 3, in raise_from
    tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat

    It seems like rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1) this line of code causes the issue. Not sure how to debug this.

Thank you in advance.

XiaoKangW commented 3 years ago

@cjw531 i also have the same problems as you. and I have to adjust batch size , but there are still problems about memory.

Jiangyu1181 commented 2 years ago

@cjw531 I also have the same problems as you. But the no_batch=True, so I can't change it.

hdupuyang commented 10 months ago

it seems that you haven't download the light probes. you can download them in the author's project pages. And then set the 'test_envmap_dir' term in lr5e-3.ini