jalberse / RayTracingInOneWeekendInRust

Ray Tracing In One Weekend, in Rust
1 stars 0 forks source link

HRPP #75

Open jalberse opened 1 year ago

jalberse commented 1 year ago

Implementation of hash-based ray path prediction for acceleration structure traversal elision.

A preliminary step may be to change from an f64 based system to an f32 based system. This is more standard, efficient, and the HRPP paper uses f32 for their hash function (though we could develop one for f64). We could potentially make the renderer use either, but that may be overcomplicating things. I used f64 originally because I saw some visual artifacts while using f32 for very large objects. I think we can accept those visual artifacts, as we realistically render those large objects very rarely, and that could potentially be addressed in other ways.

jalberse commented 1 year ago

https://arxiv.org/abs/1910.01304

jalberse commented 1 year ago

A side note is that the Renderer should really be called an Integrator I think. It's okay though.

jalberse commented 1 year ago

Of course, we could use f64 and just make a hashing function for that, but I want to be consistent with the paper for comparison purposes.

jalberse commented 1 year ago

Note to self: Using this command for some testing:

cargo run --release -- showcase --image-width 800 --aspect-ratio 1.0 1.0 --samples-per-pixel 2000 --cam-look-from 478.0 278.0 -600.0 --cam-look-at 278.0 278.0 0.0 --cam-vertical-fov 40.0 > test.ppm

jalberse commented 1 year ago

While changing from f64, everything is mostly straightforward.

marble.rs has an issue where the turbulence API uses f64. I've just casted for now, but noting in case there's some visual issue with it.

jalberse commented 1 year ago

The hashing function is implemented.

I think I want an hrpp::Predictor struct that stores the hash table and contains whatever helper functions I need to do stuff.

It can also store statistics we need, and we can write out those statistics at the end of rendering.

jalberse commented 1 year ago

{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [

    {
        "type": "lldb",
        "request": "launch",
        "name": "Debug executable 'shimmer'",
        "cargo": {
            "args": [
                "build",
                "--bin=shimmer",
                "--package=shimmer"
            ],
            "filter": {
                "name": "shimmer",
                "kind": "bin"
            }
        },
        "args": ["earth", "-w", "400"],
        "cwd": "${workspaceFolder}"
    },
    {
        "type": "lldb",
        "request": "launch",
        "name": "Debug unit tests in executable 'shimmer'",
        "cargo": {
            "args": [
                "test",
                "--no-run",
                "--bin=shimmer",
                "--package=shimmer"
            ],
            "filter": {
                "name": "shimmer",
                "kind": "bin"
            }
        },
        "args": [],
        "cwd": "${workspaceFolder}"
    }
]

}

jalberse commented 1 year ago

The change to a Vec-backed BVH seems to be working as intended. I can add parent pointers now.

jalberse commented 1 year ago

Showcase with 2000 spp took 1038 seconds with the Vec-backed BVH (on commit 20e96bf1). Another run took 974, another 984.

jalberse commented 1 year ago

Showcase with 2000 spp took 825 seconds (commit 71095ed) - i.e. before change to Vec-backed BVH..

I don't love that slowdown. Might need more trials. Not sure where the slowdown would come from. Possibly some change in how efficiently memory is being accessed. Not sure it's worth the time chasing down on this project but I might run some more trials to see performance spread...

jalberse commented 1 year ago

A test before the f32 change took 720 seconds (commit e9089795)

Something I've noticed is that, after the commit changing from f64 to f32, many of the white spheres in the showcase render appear darker - like more of a grey. I suspect that change has a bug somewhere that I did not recognize.

jalberse commented 1 year ago

For the darker spheres, I'm suspicious that the overlapping spheres might have some odd choices in terms of which sphere gets "hit" as we traverse down the BVH, leading to the darker spheres. This would go away with many many many samples, so try that with our most recent change (10k spp) to see if that fixes the "issue".

The first step is probably carefully reviewing the change to f32, though.

jalberse commented 1 year ago

It's just occurred to me the additional runtime cost is probably from memory allocation of the nodes Vec as we push onto it. I should reserve it. We should know roughly the required size based on the quantity of objects in the scene.

jalberse commented 1 year ago

2n + 1 should be the total number of nodes in the scene, assuming there's one object per leaf node. That's probably overshooting the capacity, but I'm okay with that I think. I would rather take a bit more memory than we need.

jalberse commented 1 year ago

Actually, that's probably not the issue, though it should be done. The time is being spent in the actual rendering loop; it builds the BVH quite quickly, and that's the only time we're allocating for the nodes array. So the performance issue is coming from somewhere else.

Still not sure how much I should care about this apparent performance dip. Maybe get more data with the not-Vec-backed BVH tree. Is that distirbution of runtimes comparable, and I just got bad samples?

jalberse commented 1 year ago

Changing to with_capacity() still resulted in 1030 sec for render for vec-backed. Seems to average around 1000 seconds for vec-backed.

I think next step is getting more samples for non-vec-based. I'm also lost as to why changing to f32 from double apparently slowed down the render as well. I might just leave all this performance stuff alone if I can't find a good answer, though.

jalberse commented 1 year ago

Non-vec-backed after a few trials hangs out around 790 seconds. Faster than vec-backed.

Profiling reveals the vec-backed BVH is spending relatively more time in BvhNode::hit() (i.e. doing traversals). Maybe branching on the new match statements? I wouldn't think that's more expensive than dynamic dispatch from dyn Hittable. Maybe the fact we're passing around the reference to the nodes vec? Maybe change that to a slice (that's cleaner anyways), but I wouldn't expect it to be expensive...

jalberse commented 1 year ago

It could be an issue with data locality I suppose. Children nodes are not nearby their parent nodes in the vector (the left child of the root node is in the middle of the list, while the parent node is at the end of the list). So if we're jumping around the Vec a lot, we could be losing cache efficiency that the pointer-based approach was able to get. If I were to dive into this further, I think I would investigate that.

But for now, I might need just accept the apparent regression. This isn't a high performance renderer, just a toy, and there's no obvious solution jumping out at me. I just want to wrap the feature development; the shimmer project can worry about more highly performance acceleration structures.

jalberse commented 1 year ago

Maybe issue with darker spheres after f32 is due to insufficient precision in cross products? I know pbrt uses double (v3) or error reducing techniques (v4) for cross(). Could improve that. It's something I only see in those spheres. the rest of the scene looks correct I think. Maybe since they're overlapping there's something going on? I'm trying with 10k samples with f32. I wonder if increasing sample count will make the issue go away.

jalberse commented 1 year ago

Issue with dark spheres persists with 10k samples per pixels. Another solution is required

jalberse commented 1 year ago

There's also some strange striping on some of the spheres. It's odd only they are effected, and not other objects in the scene...

jalberse commented 1 year ago

A test render with many fewer non-overlapping spheres exhibited the same issue. So, I think it's not an issue with them being in the BVH or overlapping. So, I think there's an issue with the loss of precision elsewhere.

jalberse commented 1 year ago

Testing with a metal material on the spheres, I don't see any issue. So maybe we can dig into lambertian handling?

jalberse commented 1 year ago

It occurred to me there might be an issue with self-interseciton of scattered rays, but adding EPSILON * 100 offset in the normal direction for the scattered ray origin didn't seem to help, so I don't think it's that.

I'm going to see if it persists with cubes compared to spheres. Is it an issue with precision with spheres? Or does it effect other geometry?

jalberse commented 1 year ago

The materials look correct for the cubes. I suspect there's an issue with the lack of precision for spheres specifically. I suppose I can try to use f64 for some operations there or other techniques for reducing error.

jalberse commented 1 year ago

The UV generation shouldn't matter - no matter the value, we're returning the same value from the solidcolor texture. So the issue is somewhere else.

jalberse commented 1 year ago

Note I'm using this command to test just the cluster:

cargo run --release -- showcase --image-width 800 --aspect-ratio 1.0 1.0 --samples-per-pixel 100 --cam-look-from 100.0 550.0 100.0 --cam-look-at -20.0 350.0 475.0 --cam-vertical-fov 40.0 > test.ppm

jalberse commented 1 year ago

Okay, using f64 for intermediate values in the sphere::hit() fn appears to have solved the issue. I'll do a render with more samples and then confirm that change.

jalberse commented 1 year ago

A challenge here is that the predictor table must be updated while rendering, so multiple threads will need mutable access to it - once for each ray.

So I think we will need something like an Arc<Mutex>

The original HRPP implementation guards the predictor access with a Mutex, so I suppose we can go with a similar implementation. The paper didn't mention it, but I wonder what impact this could potentially have as threads potentially wait for access to the predictor for each ray, or if it's not a big issue.

Note they also have the stats for the predictor behind a mutex, which I think I'll need to do as well

jalberse commented 1 year ago

I want to ensure only the predictor is behind the mutex. So it needs to be held separately from the HittableList, so that the HittableList (including the BVH) can be read-only shared via Send/Sync.

I'll need a way to tie the BVH to the predictor, though, since there can theoretically be multiple BVHs and predictions only make sense with one. Since the BVH is part of the HittableList, putting the predictor in the BVH doesn't work, as it means the BVH would also need to be behind the mutex.

I think we need an approach where the Option is passed alongside the HittableList, and passed into hit(). Most will ignore it, but BVH will use the predictor if present. We can maybe have some ID stored in the BVH and Predictor, and only use the predictor in Bvh::hit() if it matches. Or, I guess it would expect a list of predictors, expecting they are provided with IDs of any acceleration structures in the scene.

An alternative is to allow only one acceleration structure in our scene, and we expect it to be at the highest level of the scene. They wouldn't implement Hittable, they'd have their own bespoke hit() (or maybe traverse()) function. This would let us not worry about matching up the predictor and acceleration structures. But it also runs against my current architecture, and maybe makes things less flexible - but I don't know why someone would want multiple acceleration structures in one scene.

jalberse commented 1 year ago

If we wanted to limit ourselves to 1 acceleration structure, we would:

After sitting with it for a while, I think this approach is best. It's the simplest, and I think you'll only ever want one acceleration structure. I think I'll move towards developing this and see how it goes.

jalberse commented 1 year ago

I guess that there could be situations where you want nested acceleration structures? Like, triangles within the same mesh would all have the same materials/maps/etc, and maybe it makes sense to have them in the same BVH for some data locality reason...

Having multiple acceleration structures in one scene could also, maybe, be a footgun. If two acceleration structures cover the same domain space, but one holds half the geometry and the other holds the other half, it's likely that any given ray would traverse a full BVH, find no collision, and then go have to traverse the other one. If we enforce one structure, we enforce a neat partition of the domain. Maybe this is better.

jalberse commented 1 year ago

Although, the advantage of a BVH being able to hold another BVH was really that you could apply a transformation to the sub-BVH. You could make a list of triangles into a HittableList, make a BVH, and then rotate it and transform it. Losing that is actually significant, I think - you wouldn't want a HittableList after a Bvh leaf. And you want to be able to apply transforms to lists of geometry, since it makes sense with e.g. tris of a model.

So, I'm now partial to allowing for nested BVHs in a single scene.

Maybe the Predictor needs to serve as the predictor across all BVHs. If the prediction value is an Arc, then I think it doesn't matter which Bvh that BvhNode is a part of (directly, they should all be under a "top level" Bvh). If BVHs overlap, I suppose we could get more false positives. The predictor could also be moved into ray_color() outside the Bvh::hit(), since it's independent of a particular Bvh.

Except, the problem with that is that BvhNodes store node indices, not direct pointers. So, handling of BvhNodes must be done within a Bvh. The indexed approach is necessary due to ownership constraints (requiring parent pointers for go-up-level).

If we want just one predictor, then we need a way to get from the prediction to the Bvh's node list. This runs afoul of a lot of ownership problems I've dealt with prior.

Maybe a 1:1 BVH to Predictor is the best.

In fact, that's what the original paper does. We will do the same. We will have one Predictor per BVH, allowing for nested BVHs, as nested BVHs are very useful with how we represent transformations.

We can't have the BVH contain the predictor, though, which was our original problem. So I will need to have a solution to that. Maybe matching IDs really is OK.

jalberse commented 1 year ago

A (I think) novel extension of HRPP would be to have the prediction be not a single predicted node, but a list of predicted nodes. In the case of a false positive, after traversing the BVH, we add the resulting prediction to the entry in the prediction table (it shouldn't match what was there previously). This may be able to reduce the number of false positives. It would increase the cost of false positives, though, since we would have to check against multiple predictions.

This could be something we try after getting the base feature implemented.

jalberse commented 1 year ago

Okay, I guess we will pass an Option<Arc<Mutex<HashSet<BvhId, Predictor>>> to hit() (or something like that)

The BvhId will correspond to an Id given to each Bvh on construction. The Bvh will store its ID. In Bvh::hit(), if we pass any predictors, we will lock the mutex, and see if this Bvh has any Predictors available. If so, we predict. If not, we act as though None were provided.

I suppose the design isn't actually that complicated. Checking to see if this Bvh has a predictor is O(1). The worst part is in constructing the scene, we have to remember to create the Predictor for each Bvh. I suppose the Bvh constructor could take the set of predictors as a parameter, adding itself as another entry with a new predictor. This alleviates some of the burden from the caller.

jalberse commented 1 year ago

I guess if we go down the 1:1 BVH:Predictor path, then you could potentially have multiple skipped traversals. You predict at the top level BVH, skip down, and potentially hit another BVH. Within that BVH, you make a prediction, skip down (or not), and eventually find geometry. It must be this way - if your prediction index from a nested BVH was applied to the the top BVH, you would simply access incorrect nodes.

I think our initial approach of storing the parent BVH node in the HitRecord is incorrect, then. It's unclear - parent in which BVH? The hit record is passed up through successive BVHs.

Instead, I think Child::Hittable should contain the dyn Hittable and the parent BvhNode together. The parent BVH node index will be calculated the same way that it is for the BvhNodes.

We still need the leaf node's index (i.e. the parent of the Child::Hittable we eventually reach, be that geometry or a nested BVH) at the root at Bvh::hit(). But we can pass that up the BvhNode;:hit() calls alongside the HitRecord, rather than as part of the HitRecord. That way, it remains clear that the HitRecord contains information for rendering the final hit geometry, and that parent index is just kept within the current Bvh/BvhNodes so that we can use it for go-up-level.

I think with this, and with the previous comment, there's a clear path forward.

jalberse commented 1 year ago

Alright, implementing the previous two comments resulted in:

(1) a slower render than without predictors (40 seconds -> 144 seconds) and (2) significant visual artifacts

I believe the visual artifacts are likely due to what I'm calling the "missing occlusion" problem that the paper mentions. It's likely, in my estimation, that rendering these large implicitly difined spheres is a "worst case scenario" for that bug. It's possible that it's not likely to be apparent for small triangles, which is what the original paper (and most ray tracers) use. Hopefully, I can implement some triangle based collision and basic obj loading, and try to render a stanford bunny e.g. to see if the artifacts persist.

The slowdown could be for any number of reasons:

I'll want to profile to see what the slow-down is, I suppose. We could try pre-allocating space in the predictor table, maybe.

For now I will try rendering with a larger scene and just check out the results.

test_predictions

jalberse commented 1 year ago

I'll also note on the performance - we could see a case where early on in the render, HRPP is less performant because it needs to build the table, but it's more performance later in the render where it has the table already built and can make predictions. Just a possibility

jalberse commented 1 year ago

test_predictions_large

Here's a render with the full scene using predictions for the BVHs. There's a BVH for the cubes and for the spheres, held in a HittableList with the light and the others pheres. There's no nested BVHs.

I think it would be worth trying to get in basic Obj/Tri support. No UV mapping or anything - just render Tris with some SolidColor material. I'd like to see what the artifacting looks like with planar, small geometry. I reckon the algorithm just doesn't play well with relatively large implicit surfaces like we have here.

jalberse commented 1 year ago

Also realized I was testing without a top level BVH for the scene, just for the spheres and cubes. Adding a top-level BVH to the scene...

let world_bvh = Bvh::new(world, 0.0, 1.0); let mut world = HittableList::new(); world.add(Arc::new(world_bvh));

It led to an interesting result

few_objects_nested_bvh

When I have just these 4 objects (the light, orange sphere, cubes BVH, and spheres BVH), the scene renders as expected with a top-level BVH, just is it would without HRPP (this scene doesn't include the other objects i.e. earth and metal sphere)

perlin_blocking_spheres_nested_BVH

Adding the white sphere with perlin noise, we fail to render the cluster of spheres, which is still in the scene.

Rendering with just the perlin sphere, sphere BVH, and the light (all in a top-level BVH), we do see both again:

perlin_plus_sphere_cluster

Maybe this is a bug with predictions? Maybe it's a bug with nested BVHs rather than HRPP?

jalberse commented 1 year ago

Rendering on master with a nested BVH performs fine, we don;t lose any geometry. So I'm confident there is an issue with HRPP causing the missing spheres.

Maybe this is an issue with particularly large Hittables (like the nested BVH) or a shallow BVH for the master scene.

Also, while my scenes are using spheres which are a bit odd, I think being able to render large implicitly defined surfaces is important for an algorithm. They're useful for e.g. rectangular lights, which you don't want to triangulate, even if your scene is triangulated. So if HRPP can' thandle these large surfaces well, it could be an issue. Of course, for shadow rays, that's not an issue - but my ray tracer doesn't use shadow rays, so maybe it's not a good fit here.

jalberse commented 1 year ago

Note to self:

cargo run -- triangles --image-width 500 --aspect-ratio 1.0 1.0 --samples-per-pixel 100 --cam-look-from 278.0 278.0 -800.0 --cam-look-at 278.0 278.0 0.0 --cam-vertical-fov 40.0 > test.ppm

jalberse commented 1 year ago

I have got triangular meshes rendering now. I don't have UVs properly set up, but we don't need that for this feature.

Run with BVH, without predictions:

cargo run --release -- bunny --image-width 800 --aspect-ratio 1.0 1.0 --samples-per-pixel 500 --cam-look-from 278.0 278.0 -800.0 --cam-look-at 278.0 278.0 0.0 --cam-vertical-fov 40.0 > test.ppm

159.85 seconds

bunny_no_predictor

jalberse commented 1 year ago

While the version with predictions renders, I do wonder if it would be beneficial to simply have multiple prediction tables, one for each thread, to avoid locks and the cost incurred by thread synchronization.

Since I apply a tile-based parallelization technique, co-local rays are likely to be handled within the same thread. So, the cost of coordinating threads likely outweighs any benefits you get from predictions made on another thread.

I think trying to have a prediction table for each thread could be worthwhile.

First, let's look into correctness. Then, try pre-allocating space for the prediction table. Then, try meassuring performance and getting polling statistics. Then, we can look into this and other performance improvements.

jalberse commented 1 year ago

Double-checking reading the original paper again, it doesn't mention the time-cost of thread synchronization for this method. They focus on the number of computations skipped. I believe their intention was that theirs was a limit study examining the number of computations than can be skipped, with the expectation that specialized hardware would improve runtime.

So, I think experimenting with independent predictor tables for each thread, and examining if we can get a similar percent of computations skipped without parallelization costs, is a good idea.

jalberse commented 1 year ago

And, a render with a predictor. Good news: The artifacting is not present! I was correct that those large implicit surfaces posed a challenge for the algorithm. That's an interesting find, but we can move forward on performance comparisons etc with triangular meshes, and recognize that as a self-contained talking point.

There is one artifact on the left ear, I think. Maybe this could be improved with more samples.

Statistics for BVH/Predictor BvhId(770a7355-a2c1-4b2d-a18f-427f89fe1f8b) True positive predictions: 5421285 False positive predictions: 1722049355 No predictions: 245564428 Table size (number entries): 3459136

(I should add these to the output) Total rays: 1,973,035,068 True positive %: 0.2%

Render time: 1082.7439513s

bunny_predictor

jalberse commented 1 year ago

The true positive % is very low, at < 1% skipped. The paper boasts more, 30-60% skipped. They do mention such low rates for hit-any in complex lighting scenarios, though.

I am using a go-up-level of 1 and a hash precision of 5 bits. The table they reference uses a go-up-level of 0, and a hash precision of 6. So, maybe that's better? Although the individual sections claim a go-up-level of 1 and a hash precision of 5 bits is the best balance. I will try 0 and 6 and see what the results are, and it should let us compare to the table more

It's worth noting that my ray tracer does NOT bias towards light sources and downweight (as Shirley's third book outlines). I maintain a truly random distribution of secondary rays. This reduces secondary ray locality, so I may see less skips.

This test is also with the Bunny in its own BVH, the scene itself is not in a BVH. Maybe that effects things. Could try putting whole scene into a BVH.

We can also double-check against the original code to see how they count certain stats - are we counting the same thing?

jalberse commented 1 year ago

Okay, another render with 5 bits of precision, 1 go-up-level:

Statistics for BVH/Predictor BvhId(9eac59cd-1c77-4bb4-8502-434627a8f812) Total rays into BVH::hit(): 1972519861 True positive predictions: 3993027 Ratio true positive: 0.0020243279 False positive predictions: 1714591962 Ratio false positive: 0.86923945 No predictions: 253934872 Ratio no predictions: 0.12873629 Table size (number entries): 3458451

Render time: 1033.4911907s

That's a LOT of false positives. The table in the paper used 6 bits of precision, I think we should try that as it should be more strict on collisions and result in more true positives and less false positives.

jalberse commented 1 year ago

The same render, with 6 bits of precision. We see the true positive ratio double, though it's still < 1%. False positive ration reduces significantly, moving to no-predictions. The number of false predictions seems too high to me, though. I would think that it would be relatively rare, as long as only co-local rays are actually getting collisions and non-co-local-rays are not getting collisions. The trend of these changes makes sense for increasing hash bit precision, but the magntitudes are still off.

Statistics for BVH/Predictor BvhId(e6c6fcda-2527-4290-8846-a84f1e1e056d) Total rays into BVH::hit(): 1971963887 True positive predictions: 8451503 Ratio true positive: 0.0042858305 False positive predictions: 1358293610 Ratio false positive: 0.6888025 No predictions: 605218774 Ratio no predictions: 0.30691168 Table size (number entries): 36449007

Render time: 1152.9752075s

jalberse commented 1 year ago

And, the same render with go-up-level 0 and 6 bits of precision. It's about what I expect - a slightly lower true positive prediction ratio (as we have one fewer leaf node we could try to hit), and a slightly smaller table size (actually, I don't quite understand why this results in a smaller table size, but that's what the paper shows as well).

Statistics for BVH/Predictor BvhId(07f1bd99-6216-4dc6-983e-2699d75e36ec) Total rays into BVH::hit(): 1968623368 True positive predictions: 7456560 Ratio true positive: 0.0037877026 False positive predictions: 1365786373 Ratio false positive: 0.6937774 No predictions: 595380435 Ratio no predictions: 0.3024349 Table size (number entries): 36193347

Render time: 1112.3398296s