nerfstudio-project / nerfacc

A General NeRF Acceleration Toolbox in PyTorch.
https://www.nerfacc.com/
Other
1.37k stars 113 forks source link

Can't use nerfacc in different virtual enviroments #164

Closed bolopenguin closed 1 year ago

bolopenguin commented 1 year ago

Hello!

I have been trying to use the same nerfacc and cuda version in two different virtual enviroments. In results the same code works only with the enviroments that triggered the nerfacc compilation, while the other one results stucked at the first nerfacc call.

How to reproduce the bug:

aabb = torch.tensor([0.0, 0.0, 0.0, 1.0, 1.0, 1.0], device="cuda:0") rays_o = torch.rand((128, 3), device="cuda:0") rays_d = torch.randn((128, 3), device="cuda:0") rays_d = rays_d / rays_d.norm(dim=-1, keepdim=True)

where it gets stuck

t_min, t_max = nerfacc.ray_aabb_intersect(rays_o, rays_d, aabb)

Unreachable code with the second enviroment

print("Done")



There is a solution to this problem that does not involve removing the compilation folder and recompiling nerfacc again?
liruilong940607 commented 1 year ago

I have multiple local conda env that has nerfacc installed. I don't think it relates to multiple env. If the JIT compile stucks forever, the ultimate reason is that there is a .lock file generated by another (or previous) JIT call, that has been put into the directory ~/.cache/torch_extensions/pyXX_cuXXX/nerfacc_cuda/. Removing this file or this entire folder should unlock the JIT compiling.

I have been adding a few logic inside the code to try to avoid this lock issue. I would love to know what's your exact operation that causes this. It should not appear in nerfacc 3.3.0 but might be something I didn't foresee.

In addition to that, you can install the pre-build wheels for nerfacc==3.4.0. They are identical to nerfacc==3.3.0 just you don't need to build them anymore locally.

bolopenguin commented 1 year ago

Sure! here what I do (I don't use conda):

python3 -m venv test
source test/bin/activate
pip install -U pip setuptools
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117
pip install nerfacc==0.3.3 
pip install packaging
python script.py # launch the script I wrote above, nerfacc compiles

# Create the second env
python3 -m venv test2
source test2/bin/activate
pip install -U pip setuptools
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117
pip install nerfacc==0.3.3 
pip install packaging
python script.py # got stuck

Here what there is inside the cache folder after I launch the script the first time with the first env:

While launching the script the second time from the other env, in the folder there are the same files and a new lock file as you said. Then if I KeyboardInterrupt the script, inside the folder there are the following files:

And If I launch again the script it works, after a long delay, no matter from which enviroment.

Anyway, I performed some more tests, and I have noticed that the script actually works without removing the cache folder, there is just a very long delay (3-5 minutes) when I switch enviroment, so I just assumed it was stucked.

Also I can't test my code with nerfacc 0.3.4 because I encounter the problem descripted in #156.

In conclusion, I was wrong, it works correctly with the 0.3.3, there is just a long delay when switching enviroment. You can close the issue if this is the expected behaviour, thanks a lot.

liruilong940607 commented 1 year ago

Thanks for the info yeah it is as expected.