Device-allocated buffers are not accessible on the host after raytracing finishes

srogatch commented 2 years ago

I'm not sure if it's at all possible in OptiX, but if it is, exposing this in OWL would be nice. So my raygen program allocates memory on the device with malloc(). I would like to return this buffer to the host in an array of pointers, one pointer for each ray. I created an OWL buffer on the host, passed it to the raygen program, so the latter assigns the pointers into the array items. However, after raygen finishes, if I try to cudaMemcpy() from a pointer from the array, I get error 1: invalid argument. Is it because the allocated in raygen device memory is somehow disposed when raygen finishes? Is there any workaround? Pre-allocating a definitely sufficient amount of memory on the host looks like overkill (and amounts to allocating the amount equal to GPU total memory).

ingowald commented 2 years ago

Huh; interesting question - I didn't even know you could do device-side malloc in a raygen in the first place. Reading up on device-side malloc it does say, though, that memory allocated this way is for the lifetime of the context, not the launch, so if it worked at all I'd expect the memory to stay around. Now the question, though: how'd you read those device-allocated pointers back to the host? Which buffer type (managed vs device vs pinned? and which elemnet type?), and did you do a owlbufferdownload? There's two or three different ways how you could theoretically do that, and a good chance that there's some combination of inputs that's just not handled right now. Easy to fix, probably, just have to know what to look out for :-)

srogatch commented 2 years ago

I think device-side malloc() leaves us with the only option for its allocated bytes: device buffer. But a pointer to that buffer can be stored in either of the OWL buffer types: device, pinned or managed (I'd prefer pinned for now to save device memory). The element type is void*, but byte-level operations would suffice too. I tried both OWL_RAW_POINTER and OWL_BUFFER_POINTER, but cudaMemcpy() fails when I take the pointer from such a buffer. No, I didn't do owlBufferDownload() because I used pinned memory buffer, so I assumed it contained valid pointers on the host, as the device writes host memory directly. And the pointers looked valid in cuda-gdb, but something happened to the data they pointed to.

ingowald commented 2 years ago

I think device-side malloc() leaves us with the only option for its allocated bytes: device buffer. Not really - the data you're allocating will be on the device, but the buffer only stores the pointers to that memory, so the buffer itself can be managed or host pinned; what your pointers point to will be on the device, for sure, but the pointers will not point into that buffer. That said, you yourself say you used a pinned-mem buffer, not a device buffer!?

OWL_BUFFER_POINTER cannot work - that data type is explicitly for what is a OWLBuffer on the host, and becomes a device pointer on the device - so that only makes sense when uploading stuff; downloading doesnt make sense because the device can't create OWLBuffers. So probably what happened when you tried this type is that it just siledly dropped it because it decided that this doesn't make sense (and yes, whether that'd better be an explicit error is an interesing question, but let's table that).

OWL_RAW_POINTER should work though; that's nothing but a 64-bit int, so sohld work just fine. If the buffer you're writing these poitners into is either managed or pinned then you also wouldn't need to do owlDownload, because those two buffer types are "shared"; if oyu use a device buffer you'd need to do a download.

Hm. First thing I'd try is check if the pointers actually make it back down to the host in a valid way: in your raygen, just printf() the first N such pointers, then on the host printf() the same as well, and check if they are the same, and look "sensible".

If that fails, any chacne you could share that code w/ me? (via emai lif you don't want it to be publicly seen)?

srogatch commented 2 years ago

Yes, I allocated a pinned buffer on the host and the device was writing pointers to that buffer. Initially, I tried OWL_RAW_POINTER, and only then resorted to OWL_BUFFER_POINTER just in case it works. I'll try to modify a sample to reproduce that.

srogatch commented 2 years ago

Please, find a reproducer in https://github.com/srogatch/owl/tree/repro/device-malloc (sample 06 is modified). I also noticed that the last y is never handled in a launch, so I had to change the following

//const int pixelIdx = pixelID.x+self.fbSize.x*(self.fbSize.y-1-pixelID.y);
  const int pixelIdx = pixelID.x+self.fbSize.x*pixelID.y;

for the first 4 pointers to get printed. It's something for you to investigate: pixelIdx<4 is never true in the raygen program without the above change. The pointers are identical, but cudaMemcpy fails:

./sample06-rtow-mixedGeometries 
#owl.sample(main): owl example './sample06-rtow-mixedGeometries' starting up
#owl.sample(main): creating the scene ...
#owl.sample(main): created scene:
#owl.sample(main):  num lambertian spheres: 204
#owl.sample(main):  num dielectric spheres: 14
#owl.sample(main):  num metal spheres     : 36
#owl.sample(main): building geometries ...
#owl.sample(main): launching ...
 0 0x7fd3909acd50  1 0x7fd3909ace40  2 0x7fd39095f1b0  3 0x7fd39050fa40 #owl.sample(main): done with launch, writing picture ...
#owl.sample(main): written rendered frame buffer to file s06-rtow-mixedGeometries.png
GPU error 700: an illegal memory access was encountered . /home/serge/work/Upwork/raytracing/src/voxels_holes/external/owl/samples/cmdline/s06-rtow-mixedGeometries/hostCode.cpp:593
 0 0x7fd3909acd50  1 0x7fd3909ace40  2 0x7fd39095f1b0  3 0x7fd39050fa40

ingowald commented 2 years ago

K; just started looking at it. Note the fact that not all pixelIDs get called may look surprising at first, but isn't: due to the warp nature of CUDA the launch sizes have certain constraints (ie, you can't do 3x5 launch, only, say, a 8x8 or 32x32 one), so the launch is typically a bit bigger than the actual number of pixels in a screen. Then depending on how you enumerate your pixels either the fst few or last few lines/rows of pixels will never get called on.... so that part is OK, but i'l look into the rest right now.

ingowald commented 2 years ago

K; just had a look. The code actually looks totally OK to me, and it seems as if it's doing most of it right; however, one thing i found is that at least on my GPU CUDA doesn't let you allocate enough memory for pixel pixel's malloc, so at some point the device-side malloc() call returns a nullptr, and you writing to that pointer then makes the raygen program crash.

For thoes writes before that crash, the pointers actually do look right on both host an device-side, and also the values that a cudeMemcpy reads from them are right .... so at least as far as I can see I'd currently blame it all on the malloc()s returning null for some pixels - if the raygen crashes before the respective pixels' values are written then your host side might not see the reight entries in the buffer, but on my side they do look correct, in particular if I catch and handle the nullptr from malloc. (BTW: IIRC there's some cuda call to set the size of device of the deviceside malloc heap, but it's a while since i used that).

Let me know if that fixes it....

srogatch commented 2 years ago

Thanks, @ingowald . I've now switched to another project for a while, when I get back to this one (and it depends on the customer), I'll continue investigating. BTW, what is your video card? Mine is RTX 2080 8GB, and the frame buffer is only about a million of pixels, so malloc shouldn't fail.

ingowald commented 2 years ago

I think you may be confusing total GPU memory with "allocatable heap" - the device side malloc operates on a dedicated subset of GPU memory called the heap; you can only malloc within this heap, and if you exceed it you get a null pointer even if there's dozens of gigabytes of "other" gpu memory still available (also see here: https://stackoverflow.com/questions/34794481/cuda-stack-and-heap/34795830).

You can set the side of that heap from the host (forgot the name of the call), but either way, the default value is only 8MB or so, so for 4 bytes per alloc that'd be at most 2M allocs you could do best case, in practice probably way less because it'll have to store size of alloced block, and maybe even do 16 byte alignement, so you'll easily hit that threshold in a single frame if you do it per pixel. (and yes, i did see null returns from malloc on a pretty beefy card - certainly not out of "regular" GPU memory :-) )

ingowald commented 2 years ago

Either way, as far as i can tell this isn't an OWL issue - if I catch the failed mallocs on the device the pointers do return to the host just fine, and are valid and accessible, too. Closing this; feel free to reopen if I overlooked something.

owl-project / owl

Device-allocated buffers are not accessible on the host after raytracing finishes #151