Sketch of copying a va_list delimited buffer for gpu libc

JonChesterfield commented 2 months ago

GPU libc would like to do things like copy the variadic arguments to printf across to the host and deal with them there. This is really easy if the extent of the structure denoted by a va_list is known and otherwise very difficult. This will work especially well on nvtpx where it is flat, and may be an argument to keep it flat on amdgpu (as opposed to changing some of the structure to pointers).

This is roughly a design note intended for @jhuber6 as writing here is more permanent than in slack.

Choices:

va_list as ptr + size in a struct
va_list as ptr + ptr in a struct
write the size in the underlying buffer
write the pointer in the underlying buffer
other?

The two things in a struct approach is simple and straightforward. va_arg moves the first pointer and leaves the other field alone. va_copy copies both. It's inconsistent with nvcc's use of a single pointer and I have some doubts about burning a second vgpr for it on amdgpu.

I think we should go with a single pointer, aimed at the next value for va_arg to dereference, and also store a pointer or size in the buffer it aims at, like:

struct
{
  int earlier;
  [ptr end | size_t len];
  int x; <- va_list currently points here
  double y;
};

va_arg would dereference x and calculate &y as it currently does, but also reads the previous field, increments it, and writes it back to where x used to be. That is, the size/length field is initially just before the nominal start of the struct, and while iterating through it, va_arg stores updated values into it.

That gives the length of the underling buffer in a fashion which still works after you've lost track of whether the start of the buffer was (a va_list passed to some function) while maintaining consistency with the ptx calling convention of a single void*. I'm expecting store/load forwarding to erase all the stack traffic in the common case. Passing a va_list to another function is free, same as it is today.

It has the drawback that va_copy has to work harder than it currently does, since the forward iterator now mutates behind it as it traverses, so va_copy has to memcpy the underlying buffer. It does however know exactly how big the remaining buffer is since that's the information we stored in it. That makes this a good idea if va_copy is rare and dubious if it is common.

I suspect there is a way to use a single void* without mutating the buffer behind it, the complexity moving into va_copy is a bit of a shame, will update if it occurs to me.

michaelrj-google commented 2 months ago

be careful when directly copying va_args from one system to another. Printf especially takes pointer arguments (e.g. %s) which may have issues if directly copied.

jhuber6 commented 2 months ago

be careful when directly copying va_args from one system to another. Printf especially takes pointer arguments (e.g. %s) which may have issues if directly copied.

So, the current rpc_fprintf thing I wrote as a hack due to this support not existing has the host give the pointer back and asks the GPU for it to be copied as well.

The difficult part is definitely va_copy unless we want to modify the abi_tag used by clang.

JonChesterfield commented 2 months ago

Yep, %s is a problem. I believe the rough game plan there is to copy everything to the host, parse it, then if some of it turns out to be %s, do some going back and forth.

Implementing va_copy in the above is trivial - you splat an alloca to the stack and emit a memcpy using pointers extracted from the current va_list - it's just ugly, and slower than the ~ free va_copy is under other schemes.

(there's some prior art in this area under gcc, but the variadic extensions there have fragile properties like disabling inlining which are unappealing)

llvm / llvm-project

Sketch of copying a va_list delimited buffer for gpu libc #96212