szellmann / visionaray

A C++-based, cross platform ray tracing library
https://vis.uni-koeln.de/visionaray.html
MIT License
432 stars 37 forks source link

Visionary : Is it a renderer or a Raytracing library ? #10

Closed stephane-lb closed 8 years ago

stephane-lb commented 8 years ago

Hi,

First of all, thanks for your amazing work, I am sad to miss you at the LGM at London, I did not know this meeting! I am really interesting by your library for an experiment where I just want to launch a bunch of ray over an object and then know when they are going out their bounding box.

Can Visionary do this ?

Thanks, Stephane-lb

szellmann commented 8 years ago

Hi Stephane, thanks, and yes, Visionaray is a library, with its primary (but not sole) focus being on image generation.

If you provide me with some more details I could probably point you in the right direction to achieve your goal. What type of result do you want to generate (an image, an array of booleans (is it a 2D array))? Where do the rays originate from? Do you want to do this on a GPU?

stephane-lb commented 8 years ago

Hi Stefan,

Thanks for your answer.

My goal is using a height field randomly generated, launching a lot of rays from a given direction on this patch and then store for each ray send store the output direction and the amount of bounce on the surface before to "go out".

So to answer your questions:

_What type of result do you want to generate (an image, an array of boolean – is it a 2D array)? _ In the first time, it will be a std::Array <std::pair <uint, Vec2 >>

_Where do the rays originate from? _ I want to generate them based on the assumption that a Ray is defined by a source and a direction. So I want to be able to write my own way to generate them. I want to test different techniques of sampling.

_Where do the rays originate from? Do you want to do this on a GPU? _ I really need to use the GPU because I planned to launch a lot of rays.

Thanks, Stephane-lb

szellmann commented 8 years ago

Visionaray currently has the limitation that it comes with vendor lock-in: the only GPU vendor supported is Nvidia for now. This is going to change, but not with the coming version 0.1 unfortunately.

If you can make do with that, I would recommend you just write a simple Cuda kernel that generates the rays, something along the lines of this probably (untested, probably needs some tweaking before it works!):

#include <thrust/device_vector.h>

#include <visionaray/math/math.h>

using namespace visionaray;

struct result_t
{
    int num_bounces = 0;
    // more members..
};

__global__ void trace_rays(result_t* results)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    basic_ray<float> r;
    // generate rays based on index (?)
    if (index == 666) {
        r.ori = vec3(6, 6, 6);
        r.dir = normalize(vec3(-6, -6, -6));
    }
    else
    {
        // ...
    }

    int num_bounces = 0;

    // Traverse ray here, e.g. count bounces

    // How much you can benefit from Visionaray
    // depends on what you do with the rays here

    // Write the result to GPU DDR3 memory
    result_t res;
    res.bounces = num_bounces;
    results[index] = res;
}

int main()
{
    // Result array on the device
    thrust::device_vector<result_t> d_result;
    d_result.resize(NumRays);
    auto d_result_ptr = thrust::raw_pointer_cast(d_result.data());

    // Trace rays
    trace_rays<<<NumRays, 1>>>(d_result_ptr);

    // Copy result to CPU
    thrust::host_vector<result_t> h_result(d_result);
}

Inside the Cuda kernel you would implement the traversal code. Visionaray has lots of functions that you can benefit from and wouldn't need to have to write anew, but what functions you actually need to implement the actual traversal depends on your use case.

Just let me know if you need any help with this. If you think that certain functionality might be useful to add or the API may be changed somehow, please tell me. Visionaray is a relatively new library, the API has matured over the last two years, but the most dominant use case is image generation.

stephane-lb commented 8 years ago

Hi Stefan,

Thanks for your help looks cool ! I have lot of questions I am not sure if it comes from my misunderstanding of visonarray API or CUDA. Hope you will help me with them.

By looking at your example "multi_hit", I understood that visionarray provided a way to code by saying if CUDA is available do this, otherwise do that. I think I am going to do something like this. Could you confirm that the inheritance to viewer_type is not needed ?

Thanks for your help,

Stephane

szellmann commented 8 years ago

Hi Stephane,

Can I generate my rays on CPU and give them to CUDA ?

Absolutely. Extending the example from above, it would look smth. like this:

typedef basic_ray<float> R;

std::vector<R> h_rays;
// generate rays here

// copy to GPU
thrust::device_vector<R> d_rays(h_rays);

// now pass the rays to the kernel
trace_rays<<<...>>>(thrust::raw_pointer_cast(d_rays.data()));

I wasn't sure if you preferred ray generation on the CPU or the device, so I just showed the latter.

In the example you gave to me, I am not understanding how you send the mesh/BVH to the graphic card ?

I weren't sure how you stored the height field (triangles, texture with z-values, ...) so I made no assumptions on how your traversal might look. Since you mention BVHs, you probably have triangles? In that case, you basically build up a list of triangles, e.g. like so:

typedef basic_triangle<3, float> Tri;
std::vector<Tri> triangles(NumTriangles);

for (size_t i = 0; i < NumTriangles; ++i)
{
    // Triangles are stored in the format v1, e1, e2
    // not sure, this may change in the future, it's a bit odd..
    triangles[i].v1 = vec3(/*vertex one*/);
    triangles[i].e1 = vec3(/*vertex two*/) - triangles[i].v1;
    triangles[i].e2 = vec3(/*vertex two*/) - triangles[i].v1;
    // unique primitive id (mandatory for BVHs)
    triangles[i].prim_id = (int)i;
    // geometry id (optional, use e.g. as index into material lib)
    triangles[i].geom_id = <some int>;
}

EDIT: Outch, forgot to mention that each triangle needs a unique primitive id!

Then build the BVH on the host:

auto h_bvh = build<index_bvh<Tri>>(triangles.data(), triangles.size());

// now copy to the GPU.
// this has similar semantics to the thrust vectors
// copy to device by constructing a device BVH
// from a host BVH
cuda_index_bvh<Tri> d_bvh(h_bvh);

index_bvh is a host type, cuda_index_bvh is a CUDA device type. There's some text in the section on accelerators in the Wiki with an introduction that I wrote.

Now the BVH is build up and you have a copy on the GPU. Then you must pass pointers to the BVH to your CUDA program / rendering kernel (you wouldn't pass the whole BVH to the kernel each time because this would always require a costly PCIe copy).

In Visionaray, and in Visionaray's default kernels, BVHs are treated as if they were primitives. You typically have a list of primitives, and functions like e.g. closest_hit or any_hit take a ray as first parameter, and iterators (or pointers) to a range of primitives (i.e. the right-open interval [primitive_first..primitive_last) ). It then doesn't matter if the primitive is a triangle or a BVH containing lots of triangles. That way, you can treat several meshes as independent primitives, which is sometimes desirable.

You can pass a pointer to a single BVH to the Visionaray program, but I wouldn't advice. It's simpler to make a list containing one BVH, like so:

// The BVH "pointer" type is a "BVH ref" (a POD type that stores pointers):

typedef typename cuda_index_bvh<Tri>::ref_type Ref;

// construct the BVH refs (to device memory) on the GPU...
thrust::device_vector<Ref> bvh_refs;
bvh_refs.push_back(d_bvh.ref());

// and then pass a pointer to the list to the CUDA kernel
trace_rays<<<...>>>(
    rays /* see above */,
    thrust::raw_pointer_cast(bvh_refs.data()), // first
    thrust::raw_pointer_cast(bvh_refs.data()) + bvh_refs.size() // last
    );

// (the kernel needs to take the Ref parameters of course)
// __global__ void trace_rays(..., Ref* begin, Ref* last);

EDIT: I corrected the example a bit, the bvh ref list must be on the device, so thrust::device_vector instead of std::vector

I haven't tested the code, but you can find this in viewer.cpp or in the multi-hit example.

The decision "use CUDA vs. don't use CUDA" is made by either compiling with the CUDA compiler (nvcc), or by compiling with a compiler only emitting x86 code. nvcc evaluates the macros __CUDACC__ (code is compiled with nvcc) and __CUDA_ARCH__ (code is compiled with nvcc, ptx device code shell be emitted). With these macros you can identify if you compile with the CUDA compiler at all, and if you compile code that is meant for the host or for the device. They are not Visionaray specific. You should use them to write platform independent code.

viewer_type (i.e. viewer_glut) is a class that comes from the visionaray_common library. visionaray_common is not part of the API, but implements common functionality that the examples and viewer etc. share. This code may frequently change. If I compile binaries to give to someone else, I don't ship visionaray_common. You are not required to use a viewer. In a real-world application, the developer probably has his/her own program with a GUI (or no GUI) and will integrate the Visionaray program with that.

Long answer, hope I formulated it so that it's clear.

Cheers, Stefan

stephane-lb commented 8 years ago

Hi Stefan,

Thank you very much for your answer, it is really helpful ! I got 3 new questions, if you are still ok to help. So if I understood well, this is how I should call trace_rays ?

    // and then pass a pointer to the list to the CUDA kernel
    trace_rays<<<...>>>(
        thrust::raw_pointer_cast(d_rays.data()), // rays
        // BVH 
        thrust::raw_pointer_cast(bvh_refs.data()), // first
        thrust::raw_pointer_cast(bvh_refs.data()) + bvh_refs.size(), // last
        // Result
        thrust::raw_pointer_cast(d_result.data()) 
    );

and that should be map to a "CUDA" function like this one ?

__global__ void trace_rays(R* ray, Ref* begin, Ref* last, result_t* results)

I am confused, I do not understand how to access to my ray in the kernel then ... Should it be something like this ?

__global__ void trace_rays(R* rays, Ref* begin, Ref* last, result_t* results)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    basic_ray<float> r = rays[index]; 

    ...
}

And to finish, what should I put to replace the ..., I do not understand these 2 parameters :

     trace_rays<<<...>>> // in your first example you said trace_rays<<<NumRays, 1>>>

Thanks for your help and the time you are spending with me,

Stephane-lb

szellmann commented 8 years ago

Hi Stefan,

Hi Stephane,

So if I understood well, this is how I should call trace_rays ?

Yes, the signature of the CUDA kernel and the way you suggest to call it looks right. CUDA kernels are similar to C functions, they can have an arbitrary (I think, never pushed this to the limit :-) ) number of parameters (too many may result in excessive shared memory usage though). Since CPU and GPU don't share address spaces, you cannot pass references but have to stick with copy-by-value or passing device pointers.

I am confused, I do not understand how to access to my ray in the kernel then ... Should it be something like this ?

__global__ void trace_rays(R* rays, Ref* begin, Ref* last, result_t* results)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    basic_ray<float> r = rays[index]; 

    ...
}

Yes, that would be an option (probably the most obvious one). Then you parallelize over the array of rays. The statement

blockIdx.x * blockDim.x + threadIdx.x;

basically means: Here's this kernel. Please execute it in parallel, on gridDim.x cores (gridDim is specified in the <<<...>>>, see below). You can globally identify each individual thread by identifying its core ID (blockIdx.x). blockDim.x is basically the number of threads scheduled on the core, and threadIdx.x is the local thread index on that core. So to find out which thread is the currently being active, you do the above calculation. Note that in reality it is a bit more intricate, you can e.g. have 2D scheduling layouts (threadIdx.x and threadIdx.y), and cores can run several blocks in parallel. Also, you typically have to implement some range checking (if (index < NumRays) process(rays[index]);), because you usually generate a power of two number of threads, which may not be true for the length of your array containing the rays (you then have a few threads that idle, because they don't enter branching).

And to finish, what should I put to replace the ..., I do not understand these 2 parameters trace_rays<<<...>>> // in your first example you said trace_rays<<<NumRays, 1>>>

Ah, forget about what I wrote first. Here you have to tell the driver how many threads to schedule on how many cores. Basically (<<<Param1, Param2>>>), Param1 is the number of blocks that your "grid" consists of, and Param2 is the number of threads per block. The CUDA scheduler then launches threads to process the grid. With your use case a 1D grid is probably best (that's why I wrote NumRays, 1). You have to play around with those parameters to find the most efficient configuration for your hardware. Check out this stackoverflow question, it has a brief but thorough enough description of the concept.

Thanks for your help and the time you are spending with me,

You're welcome, just keep asking :-)

Cheers, Stefan

szellmann commented 8 years ago

And to finish, what should I put to replace the ..., I do not understand these 2 parameters :

... for clarification: the parameters are dim3's. So <<<NumRays, 1>>> will probably not compile, you'd rather have to:

dim3 gridDim(NumBlocks, 1, 1); // It's 1D, so y=z=1
dim3 gridSize(NumThreadsPerBlock, 1, 1);

NumThreadsPerBlock == 1, as I suggested in my first answer, would obviously be a suboptimal configuration. Depending on your hardware, maybe NumThreadsPerBlock == 8 or 16 (EDIT: or 64, 128 or 256, I'm too much used to the 2D use case..) may be a good idea, but you have to try (with the helper of the occupancy calculator spread sheet).

NumBlocks then logically depends on NumThreadsPerBlock. You have (but for a few idle threads) NumRays = NumBlocks * NumThreadsPerBlock. As you can see, to keep the device occupied it is best to process many rays :-)

szellmann commented 8 years ago

Closing this. Feel free to reopen if there are further questions, or get in touch via gitter.