Error when running python inference/sample_shapenet.py ...

tanghaotommy commented 2 months ago

Dear authors,

Thanks for open-source the great work!

I was able to install the dependencies, especially the fvdb following discussion in issue #2.

However, when I tried to run the inference for generating a chair by

python inference/sample_shapenet.py none --category chair --total_len 20 --batch_len 4 --ema --use_ddim --ddim_step 100 --extract_mesh

I had the following error:

2024-07-02 22:04:35.808 | INFO     | __main__:<module>:96 - Sampling from XCube on chair ...
2024-07-02 22:04:35.809 | INFO     | __main__:<module>:98 - Saving results to ./results/chair_2024-07-02_22-04-35
2024-07-02 22:04:35.813 | INFO     | __main__:<module>:110 - Sampling 0 / 20
2024-07-02 22:04:35.888 | INFO     | xcube.models.diffusion:ema_scope:291 - Evaluation API: Switched to EMA weights
100it [00:23,  4.18it/s]
2024-07-02 22:04:59.857 | INFO     | xcube.models.diffusion:ema_scope:298 - Evaluation API: Restored training weights
Backend TkAgg is interactive backend. Turning interactive mode on.
Traceback (most recent call last):
  File "/private/home/haotang/anaconda3/envs/xcube/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/haotang/anaconda3/envs/xcube/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/private/home/haotang/.vscode-server-insiders/extensions/ms-python.python-2024.5.11021008/python_files/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/checkpoint/haotang/dev/XCube/inference/sample_shapenet.py", line 120, in <module>
    res, output_x = net_model_c.evaluation_api(grids=output_x_coarse.grid, 
  File "/checkpoint/haotang/dev/XCube/inference/../xcube/models/diffusion.py", line 723, in evaluation_api
    concat_normal = res_coarse.normal_features[-1].feature # N, 3
KeyError: -1

I looked a bit further and found out the res_coarse has no voxels inside so res_coarse.normal_features is empty. And this looks like is because the VAE decoder's result is empty (output_x.jdata is tensor([], device='cuda:0', size=(0, 64))) returned from https://github.com/nv-tlabs/XCube/blob/main/xcube/models/diffusion.py#L740. Further debugging, it turns out to be the 3D sparse convolution module returns all 0s (out_feature.jdata) for the out_feature from the fvdb pull request at Line 313 in fvdb/nn/module.py.

            if not self.transposed:
                out_feature = kmap.sparse_conv_3d(in_feature, self.weight, backend)
            else:
                out_feature = kmap.sparse_transpose_conv_3d(in_feature, self.weight, backend)

I don't know how to debug further as this call triggers the cpp module. I would really appreciate any feedback and insights on this issue, or anything I missed. Thank you very much!

xrenaa commented 1 month ago

Hi, can you try to test the fvdb installation by python setup.py test?

bluestyle97 commented 1 month ago

Same issue here.

tanghaotommy commented 1 month ago

I tried to compile and run the code on an A100 GPU which works fine. The previous error I got was on a V100 GPU.

I cannot reproduce the same error - when I ran python setup.py test, it throws "segmentation fault" now. I will report back once I can get a run of this test. I loosely remember last time when I ran the test on V100 GPU, it gave error related to something like "require Ampere GPU". But the authors said in the paper they trained the model on V100, so I don't know.

xrenaa commented 1 month ago

@tanghaotommy Thanks for your update!

For the paper model, we trained part of it on V100 GPUs. However, with the continuous development of the sparse 3D deep learning library, we decided to set a requirement for GPUs later than Ampere to enable a public release, which is easier to maintain.

For V100 GPUs, you may still use the library, but some lines of code might need to be modified. I will update the README for this information.

nv-tlabs / XCube

Error when running python inference/sample_shapenet.py ... #7