Fast code depends on PTX rcp.approx.ftz.f32 to be consistent across all architectures

weberke commented 6 years ago

The fast quasiQuoRem(float xf, float yf) in the GPU kernel (see this commit of master) requires that the PTX instruction rcp.approx.ftz.f32 works the same across all CUDA architectures.

What we need is that the rcp instruction never overestimates the value of the reciprocal; i.e., that it is either correct or a slight underestimate. On the GeForce 1080 and 980 Ti this is true.

Based solely on the underlying philosophy behind PTX, I would think that the instruction should work the same across all architectures...but I am too jaded to put full confidence in that.

I've created a branch that compiles two different versions of the gcd kernel--one in case the rcp instruction works the way we need it to in order to make the code fast, and the other one in case it is not known whether it works as we hope (or doesn't work as needed--haven't found any to work that way yet). An array of device names holds devices that we have discovered that the rcp instruction works the way we need it to, and is queried at runtime to determine which kernel to launch.

The new code is pretty ugly, so it would be really nice if rcp would work the same everywhere.

The big question is: "Is the conservative version used in the branch the way we should go, or can we rely on PTX to be so consistent?". I've searched the NVidia documentation to see whether there is something we can latch on to as a "promise" but haven't found anything yet.

quasney commented 6 years ago

I think we can't make those assumptions. If we want to control how it is rounding then we can use rcp.rm.ftz.f32. But I assume that has lower throughput than rcp.approx.ftz.f32.

Based on Reciprocal with IEEE 754 compliant rounding: Rounding modifiers (no default): .rn mantissa LSB rounds to nearest even .rz mantissa LSB rounds towards zero .rm mantissa LSB rounds towards negative infinity .rp mantissa LSB rounds towards positive infinity

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

weberke commented 6 years ago

That is what I’m thinking, too. You are right about the rcp.rm.ftz.f32 having lower throughput than the rcp.approx.ftz.f32; it’s significant. So I think we ought to have the software build include a configuration section that looks around for attached devices and tests them to see whether the rcp.approx.ftz.f32 actually performs correctly for all inputs in the range 1 <= x, y, < 2^22. If the device does perform correctly, it will be added to a list of devices that can support the faster kernel; if a device is not on the list, it will have to use the faster kernel.

I've got working code that can check all cases in under a minute on the 1080--we could incorporate that into the configuration.

weberke commented 6 years ago

After cleaning up my certify code a bit, I went ahead and merged into master the branch that has two versions of the gcd kernel.

We still need a configuration process, and the certifyQuasiQuoRem code needs to be modified to check all devices attached. Maybe @quasney can do that?

The configuration could be either a bash or Python script; its main purpose would be to automatically generate GmpCudaDevice-gcdDevicesRcpNoCheck.h. It would

read a configuration file with known certified devices
use the certifyQuasiQuoRem executable to certify any attached CUDA devices and add them to the list
save the old GmpCudaDevice-gcdDevicesRcpNoCheck.h
generate a new GmpCudaDevice-gcdDevicesRcpNoCheck.h from the list, making sure the device names have been properly sorted.

The script should run independent of the make process, so that we can run it on Owens or a p3.xlarge EC2 instance without taking the added time of compiling any code.

weberke commented 6 years ago

I realized there was a way to use quasiQuoRem<false>(float&, float) to make the transitional quasiQuoRem (now called quoRem with a bool template parameter that says whether it should be quasi or not). This entails a more extensive certification, which now takes about 3 minutes on the GeForce GTX 1080.

The configuration process will then take a fairly long time on most devices, but I believe it's worth the time to certify that this works.

See this fairly stable commit to see what I've done. I don't think I'll be making any significant changes in the next couple of weeks.

mountunion / ModGCD-OneGPU

Fast code depends on PTX rcp.approx.ftz.f32 to be consistent across all architectures #3