madmann91 / bvh

A modern C++ BVH construction and traversal library
MIT License
948 stars 91 forks source link

ReinsertionOptimizer sometimes gets stuck in an infinite loop #84

Closed colincornaby closed 6 months ago

colincornaby commented 6 months ago

I've found some sets of triangles that cause bvh to enter an infinite loop when the builder quality is high. It seems that during the optimization stage a reinsertion is queued with the same from and to node. When reinsert_node is called with this reinsertion - a node becomes recorded as it's own parent, which causes ReinsertionOptimizer::refit_from to become stuck in a loop as it will become stuck on the node that references itself as it's own parent.

I've attached a memory dump of the tris that produce this issue (the .bin itself is the dump.) I've replicated this issue on several platforms including macOS. The triangles were supplied by an external framework. In this sample data - a node with a ID of 2207 will eventually become it's own parent causing the deadlock. InfiniteLoopMesh.bin.zip

madmann91 commented 6 months ago

Thank you for this detailed bug report. I will have a look at it as soon as I can.

madmann91 commented 6 months ago

I cannot reproduce your issue with the following code:

#include <bvh/v2/bvh.h>
#include <bvh/v2/vec.h>
#include <bvh/v2/ray.h>
#include <bvh/v2/tri.h>
#include <bvh/v2/node.h>
#include <bvh/v2/stream.h>
#include <bvh/v2/default_builder.h>

#include <fstream>
#include <iostream>
#include <vector>

using Scalar  = float;
using Vec3    = bvh::v2::Vec<Scalar, 3>;
using BBox    = bvh::v2::BBox<Scalar, 3>;
using Tri     = bvh::v2::Tri<Scalar, 3>;
using Node    = bvh::v2::Node<Scalar, 3>;
using Bvh     = bvh::v2::Bvh<Node>;
using Ray     = bvh::v2::Ray<Scalar, 3>;

using PrecomputedTri = bvh::v2::PrecomputedTri<Scalar>;
using StdOutputStream = bvh::v2::StdOutputStream;
using StdInputStream = bvh::v2::StdInputStream;

inline Tri read_tri(std::istream& is) {
    Tri tri;
    is.read(reinterpret_cast<char*>(&tri.p0), sizeof(Vec3));
    is.read(reinterpret_cast<char*>(&tri.p1), sizeof(Vec3));
    is.read(reinterpret_cast<char*>(&tri.p2), sizeof(Vec3));
    return tri;
}

int main() {
    std::ifstream file("dump.bin", std::ifstream::binary);
    if (!file) {
        std::cerr << "Cannot open raw triangle data" << std::endl;
        return 1;
    }

    file.seekg(0, std::ifstream::end);
    const size_t tri_count = static_cast<size_t>(file.tellg()) / sizeof(Tri);
    file.seekg(0);
    std::cout << "Found " << tri_count << " triangles" << std::endl;

    std::vector<Tri> tris;
    for (size_t i = 0; i < tri_count; ++i)
        tris.push_back(read_tri(file));

    std::vector<BBox> bboxes(tris.size());
    std::vector<Vec3> centers(tris.size());
    for (size_t i = 0; i < tris.size(); ++i) {
        bboxes[i]  = tris[i].get_bbox();
        centers[i] = tris[i].get_center();
    }

    bvh::v2::ThreadPool thread_pool;

    typename bvh::v2::DefaultBuilder<Node>::Config config;
    config.quality = bvh::v2::DefaultBuilder<Node>::Quality::High;

    auto bvh = bvh::v2::DefaultBuilder<Node>::build(thread_pool, bboxes, centers, config);
    std::cout << "Built BVH with " << bvh.nodes.size() << " nodes" << std::endl;
    return 0;
}

Do you have some C++ code I could use to try and reproduce this issue on my machine?

colincornaby commented 6 months ago

Hi @madmann91! Thank you for replying.

Your sample code does get stuck on my machine. For output I get:

colincornaby@Colins-MBP desktop % ./a.out Found 3693 triangles

And then nothing - and the program never exits.

Unfortunately - this means it could be an environment or compiler issues. For reference, my environment is:

I have several other environments available to me - including Windows/x86_64. I will begin comparing the same code on different environments to see if I can determine what the issue is.

Which environment did you see successful execution of that code on?

madmann91 commented 6 months ago

I am running Fedora 38 on a Ryzen 2700U laptop. Does the same code work if you omit the thread pool?

colincornaby commented 6 months ago

Turning the number of threads down to 1 does not seem to help. Also - the issue is reliably reproducible in the exact same spot on a number of Macs. That might imply something other than a threading issue or race condition.

I did some more working looking at different configurations in addition to macOS/ARM64:

So far it seems like macOS is the issue - but I'm not clear why. It is helpful that my Windows box does not show the same issue. I might be able to compare the macOS and Windows executions to figure out what the difference is.

madmann91 commented 6 months ago

Turning the number of threads down to 1 does not seem to help.

Could you try using the version of the function that does not take a thread pool instead? Just remove the thread pool from the argument list, it should just work.

colincornaby commented 6 months ago

Turning the number of threads down to 1 does not seem to help.

Could you try using the version of the function that does not take a thread pool instead? Just remove the thread pool from the argument list, it should just work.

Thanks for clarifying! Sorry I missed that version of the function.

Yes - when I remove the thread pool the code does finish.

tksuoran commented 6 months ago

I run the test on Apple M2 Max, AppleClang 15.0.0.15000309. I was not able to reproduce the issue using Xcode cmake generator. I tested Debug and Release configurations, and I tested with thread sanitizer and UB sanitizer. Colin, could you run with thread sanitizer?

colincornaby commented 6 months ago

@tksuoran - Interesting result.

When I turn on thread sanitizer, it succeeds. With thread sanitizer off, it hangs. Have you tried with thread sanitizer off?

madmann91 commented 6 months ago

Perhaps this isn't linked to threading at all, and is just a consequence of the fact that the order in which the reinsertion candidates are selected is different. Could you perhaps serialize the BVH to a file using the serialization functions, and write out the sequence of reinsertions performed on the tree before the hang? That way, I should be able to reproduce the problem on my machine.

To do that, you can add the following code to the test:

std::ofstream bvh_file("bvh.bin", std::ofstream::binary);
StdOutputStream stream(bvh_file);
bvh.serialize(stream);

You can print the sequence of reinsertions in reinsertion_optimizer.h, line 257:

std::cout << "**** iteration: " << iter << std::endl;
for (auto& reinsertion : reinsertions)
    std::cout << reinsertion.from << " " << reinsertion.to << std::endl;

Then, if you re-run the test like this:

./test > reinsertions.txt

You should get a file bvh.bin and the sequence of reinsertions should be in reinsertions.txt.

colincornaby commented 6 months ago

I think with that change the serialization would never occur because the freeze happens on this line: auto bvh = bvh::v2::DefaultBuilder<Node>::build(thread_pool, bboxes, centers, config);

Would gathering bvh.bin on medium quality to avoid reinsertion be workable?

madmann91 commented 6 months ago

Ah, right, sorry. It's still possible to add it to the default builder itself, in default_builder.h, line 43, right before the call to the reinsertion optimizer.

colincornaby commented 6 months ago

Here's a dump of the BVH. The console output of the optimizer was:

colincornaby@Colins-MBP BVH-Issue % ./a.out                           
Found 3693 triangles
**** iteration: 0
2 3
22 23
326 350
182 316
324 326
331 326
216 209
19 16
146 2206
148 2205
2205 148
132 135
**** iteration: 1
2264 2278
2280 2278
2232 2278
2230 2278
2209 2229
2212 2229
2220 2229
2207 2207

bvh.bin.zip

madmann91 commented 6 months ago

It appears your BVH has an incorrect structure. I'm not sure why, but node 2207 should be a left child (it has an odd index), but it appears it is linked by its parent as a right child. Have you made any change to the BVH or is it straight from the builder?

colincornaby commented 6 months ago

The BVH is straight from the builder. However -

While replying to you I did a diff to make sure I had not inadvertently introduced any changes to the BVH construction. I had not - however I had not yet pulled in the commit for #81. Once I pulled in the latest master branch - including that commit, the issue stopped occurring. This includes both in the sample app you provided and our internal tests for our application.

If thats an adequate resolution for you I can close this bug in since it may be linked to #81.

madmann91 commented 6 months ago

Ah, of course, this would have created invalid BVHs! That does indeed explain why it was seemingly random and not happening on my machine. Nice work figuring this one out. I'm closing this. I really don't think it will happen again, but if for some strange reason it does, don't hesitate to re-open.