bitxue commented 9 years ago

Hi, Benedikt,

Just to report to you a strange problem encountered when i used a larger lutMemory and dataMemory to generate voxels of different resolutions.

My computer configuration: 10GB memory, Win7 64 bit and the program is compiled with vs2013 with 64 bit release configuration (with the latest code). tested model: xyzrgb_dragon

I have used a lutMemory and dataMemory size of double the default value. static const size_t lutMemory = 2 * 512 * 1024 * 1024; static const size_t dataMemory = 2 * 512 * 1024 * 1024;

The test results is as below: resolution result 256 [ok] 512 [ok] 1024 [crashed] 2048 [crashed]

The problem can be easily reproduced. can you help us to see what is wrong there? Thanks in advance!

Best, Junjie

bitxue commented 9 years ago

define GENERATE_IN_MEMORY 1

bitxue commented 9 years ago

The program stoped at this line: //PlyLoader.cpp void PlyLoader::processBlock(uint32_t _data, int x, int y, int z, int w, int h, int d) { ... memset(_counts, 0, elemCount_sizeof(uint16_t)); // stoped here ... }

and pop-up this information: Unhandled exception at 0x000007FEE20BC9E7 (msvcr120.dll) in VoxelBuilder.exe: 0xC0000005: Access violation writing location 0x0000000055E11000.

tunabrain commented 9 years ago

Okay, I just pushed a fix. Can you confirm that it works?

Just FYI, I also pushed a range of fixes for compilation errors and warnings under MSVC, that might cause a merge conflict on your local copy.

bitxue commented 9 years ago

It works well with the lutMemory and dataMemory size (1024 * 1024 * 1024) after i update to the latest code. But it won't work with a larger lutMemory and dataMemory size (2048 * 1024 * 1024), the program seems to have entered an infinite loop.

tunabrain commented 9 years ago

This works fine on my machine.

Note that the expression 204810241024 causes an integer overflow, since integer literals in C/++ are usually 32 bit (I believe MSVC also gives out a warning). You need to make sure the expression does not overflow before being stored in a size_t.

Something like size_t(2048)10241024 works for me.

bitxue commented 9 years ago

I will try again later and report the results. Thank you so much!

bitxue commented 9 years ago

Hi, Benedikt,

You are right. There is indeed an integral constant overflow caused by the expression 2048_1024_1024. Sorry for that I did not notice this. It works fine on my machine now.

I also notice that it takes longer time to generate voxels with the same resolution using a larger memory size. Take the xyzrgb_dragon model as example, to generate a 1024 resolution voxels, it takes 45s with 2048 * 1024 * 1024 memory size, while it takes 28s with 1024 * 1024 * 1024 memory size. It seems that for low or medium resolution, a small memory size is better. Is that true?

tunabrain commented 9 years ago

Good point. I think this is mainly because it memsets some redundant things. I just pushed a fix, hopefully that should speed things up.

Also, just to note: the data memory is used to cache cubical voxel blocks with power-of-two side lengths. A side effect of that is that more data memory only improves the performance every 8-fold increase. So, for example, to cache a block of size 256^3, the program uses 288MB of data memory. To cache the next largest block of 512^3, it needs 8 times that, i.e. 2304MB. If you supply 2048MB, this is a little bit short, and the program falls back to 256^3.

With the newest patch, the program should hopefully not be slower if you supply 2048MB than if you supply 288MB, but to make it faster you need to bump the 2048 to 2304MB, so it can fit a larger block in memory. For the next speed increase it would need more than 18GB, which is probably outside of what any reasonable machine has :) In other words, increasing the data memory beyond 2304MB is not going to have any effect.

bitxue commented 9 years ago

Hi, Benedikt,

Actually we have a HP Z820 workstation with 128GB system memory. We want to visualize an entire airplane model by ray casting an super high resolution (might be 32768^3 or 65536^3) SVO. Currently I am using your project to handle this but i am not sure whether it is capable to handle. I guess there will be two issues: 1). building speed, it may cost a couple of days. 2) the storage size may be very large, which may need more compact compression schemes. Currently i was testing to generate SVO at 32768^3 resolution. Do you have some suggestions about this? or ideas to improve the project toward higher resolution and performance?

tunabrain commented 9 years ago

Interesting!

At these large resolutions, I think there's 2 issues: 1) The triangle-to-voxel conversion code suffers from performance degradation when used at very high resolutions. This can be fixed by using a smarter acceleration structure. 2) Although the voxel conversion code can deal with 64bit addresses, the final SVO itself can only address 4GB before it gives up. This is because child pointers are stored in relative addressing with at most 15bit precision, and addresses larger than that overflow to a global "far pointer" table with 32bit precision. If an address exceeds 32bit, the SVO cannot address it. I will have to run it on a few very large models and see what the octree size looks like. I might have to rewrite parts of it to combine relative- and far pointers to extend the pointer size to 47bit.

I will have to think about this, but I'll probably rewrite the conversion code to resolve 1). 2) requires testing, but I have some high resolution models I can run it on. I'm pretty busy with other projects, but hopefully I have some time over the weekend to look at this.

This code base is pretty old now, but if I'm at it I could also convert it to a CMake project and incorporate C++11 features. Do you have a C++11 capable compiler for your project (MSVC 2013 works) and are you familiar with CMake?

bitxue commented 9 years ago

I am so appreciate to know that you may have time to try to solve the problems you raise.

Yes, I am exactly using MSVC 2013 now and i am quite familiar with CMake. I also have some large models, e.g. the UNC PowerPlant model and the Lucy model, which is publicly available for download. I think i can participate in testing the new code.

tunabrain commented 9 years ago

Just to give a quick update, I've converted the project to C++11 and CMake and started doing benchmarking and performance optimization. The triangle to octree conversion is now more than 8x faster on my machine, and I might be able to improve it further.

I will do some more performance improvements and testing and hopefully push the code tomorrow. It's a larger rewrite and the new version might be less stable initially, so some testing is required.

bitxue commented 9 years ago

Wow, well done! I will test the new code asap it is pushed.

BTW, will the large resolution (like 32768^3 or 65536^3) be supported in this update?

tunabrain commented 9 years ago

I managed to do more performance improvements today, and I think it will be ready to push tomorrow (still need to do some cleanup). The speedup is quite sizable, although I will do proper benchmarking first.

I was hoping to resolve the issue with octree sizes as well, but it's more complicated than I thought. Unfortunately this is a problem with top-down construction - the octree does not know that it needs to use larger pointers until it's too late to embed them into the data structure. One solution is to extend the descriptors to 64bit always, although this will double the octree size and may affect raymarching performance. I have an idea using chunked memory allocators that won't build the contiguous memory block until the end, allowing to inject data after construction has finished. It will take some more time to code though. I'll keep you updated.

tunabrain commented 9 years ago

Ok, it's pushed!

Notable performance changes:

The PlyLoader determines in a preprocessing pass which triangles overlap which cache blocks. This is done with a bucket sort in linear time, reducing the conversion time complexity from (number of cache blocks*number of triangles) to (number of triangles). This is a huge speedup for many triangles and large resolutions
The VoxelData now employs two lookup tables instead of one, which removes costly sub-block scans on lower levels of the octree build. This is a big speedup for large resolutions
The PlyLoader now uses more approximate heuristics for computing average normal and color during conversion. This reduces the memory consumption of the loader by a factor of 18 and removes extremely costly memset operations
The remaining memsets are now overlapped with read operations to build destructive reads. This greatly reduces bandwidth costs and removes all remaining memset calls
Lookup tables are now built using multiple threads
The top level lookup table construction uses efficient "is-empty" queries for triangle meshes, which are constant time and effectively cut conversion time in half
The hottest functions are now in header files and marked as inline, shaving off precious minutes of conversion time

For octree size:

The octree now allocates memory in chunks of 16kb instead of using an std::vector, leading to more predictable memory usage. It will merge chunks after construction by blitting it into a contiguous buffer, and by using the fact that the OS only pages in memory at the first write and continuously releasing old chunks, it can keep memory usage bounded (which is not the case for std::vector). Basically, this makes large octrees more feasible
The memory allocator is capable of efficiently injecting delayed writes at the finalization step, which allows the octree constructor to expand child descriptors to 64 bit on demand. This allows the octree to stay memory efficient, while extending the addressable octree size to 46bit (65'000 GB)

For usability:

I removed the lutMemory parameter. There is now only a single dataMemory parameter, and the code will determine the best usage of that memory automatically
The code will print memory usage statistics, cache block size and how much memory it would need for the next highest cache block size at startup
The code will periodically print progress information as well as an approximate estimate of the remaining conversion time (check the stdout.txt while it runs)
While the camera view changes, the program will render at 1/3rd the solution to get very fast feedback while moving the mouse. Once the camera sits still, it renders at full resolution
The mouse controls are now a lot less broken

I also replaced the Makefile with a CMake project and converted the code to use C++11 features. I updated the Readme with build instructions.

Before, creating a 8192^3 voxel volume from 500k triangles took more than 40 minutes on my machine. Now it only takes 105 seconds (more than 20x speedup). Also, creating a voxel volume with resolution 32768^3 from a mesh with 7 million triangles now only takes 25 minutes on my machine (the resulting octree is 8GB large).

The nice thing about the new code is that the runtime only depends on the number of non-zero voxels, not the total number of voxels. In other words, the runtime increases quadratically, not to the third power, i.e. doubling the resolution only makes the conversion time increase by 4x. The octree size also scales quadratically, so on my machine a 64k voxel volume would take 2 hours to produce and would take up 32GB of space (I don't have that much memory, so I can't test).

bitxue commented 9 years ago

I've tested the new code with two models.

1) a mesh 5 million triangles, at resolution 32768^3, the resulting octree is 24GB large. (about 8 hours) 2) a mesh 20 million triangles, at resolution 65536^3, the resulting octree is 58GB large. (about 24 hours)

I lost the log file, so the building time above are approximate values.

Next, i will test a model with 350 million triangles at resolution 65536^3 (it may cost several days to build...). Also, the model contains thousands of .obj files, so first i need to add some code to load these .obj files. I will report the results if i can manage to do this.

Thanks for the awesome work!

tunabrain commented 9 years ago

I'm glad to hear that it works!

I just pushed another update that enables multi-threaded triangle conversion. For large triangle meshes (tested with 7 million tris) this made conversion ~50% faster.

I also implemented octree compression when saving/loading. It's using the LZ4 compression library, which can achieve extremely fast decoding/encoding rates (~2GB/s when decompressing) while yielding reasonable compression ratios. For a 32k octree, this reduced the file size from 8.1 GB to 2.7 GB for me, roughly ~3x reduction (this heavily depends on the model though). The in-memory size is unchanged, but the size on disk will be reduced. Notably, this actually reduced loading times for me, since the decompression takes less time than loading the uncompressed file to memory, at least for large octrees.

bitxue commented 9 years ago

Is it possible to employ the block-based compression scheme as described in ESVO? It seems more compact for in-memory size.

tunabrain commented 9 years ago

I think it's possible, but there's two problems:

The bulk of the memory is in the leaf nodes (normals/color data), which won't compress with ESVO
The conversion times and memory consumption during conversion will most likely increase quite a bit

Currently I also don't have spare cycles for this project, so I'm going to put this on the backlog.

Unrelated to your question, this thread has evolved away a bit from the original issue, which has been resolved a while ago. I'm going to close this issue since it's been resolved, but feel free to open new issues for any questions/problems you have.

tunabrain / sparse-voxel-octrees

Program stoped to work with larger lutMemory and dataMemory #6

define GENERATE_IN_MEMORY 1