jsoftware / jsource

J engine source mirror
Other
657 stars 90 forks source link

i./I. special code for when x fits in one or a few vectors #144

Open moon-chilled opened 2 years ago

moon-chilled commented 2 years ago

I. from marshall here

moon-chilled commented 1 year ago

I suspect marshall's method for I. is not good for i., as it performs extraneous compute ops, and i. is compute-bound (except when the table blows out of cache). But here is a really cute idea. When x is sufficiently small (and y sufficiently large), do perfect hashing, and perform the hash lookups using only in-register permutes!

moon-chilled commented 1 year ago

avx512, for x w/52 significant bits:

load (1c @ p3) multiply (1c @ p0) permute x2 (2c @ p5) compare (1c @ p5) blend (1c @ p0) store (1c @ p4)

3 cycles per 8 elements, for x up to 16 words (can handle larger x if you build the muxer by hand; quickly explodes, but there's a threshold where it's worth it). Not bad!

(blend can go to p0 or p5--damn well better not go to p5)

moon-chilled commented 1 year ago

avx2 (needs x w/32 significant bits):

load multiply (1c @ p01) permute x2 (2c @ p5) compare (1c @ p01) blend (2c @ p01) store

2 cycles per 4 elements--actually not too shabby (though the range limitation rankles, and getting the hashing right is harder, maybe not always possible, due to the 32-bit permute, and x can only be 4 words). Scheduler can again screw us over.

moon-chilled commented 1 year ago

uica says the scheduler will be nice for avx512. Even accounting for the move-elim erratum, which I forgot about; makes no difference, because p5 is the bottleneck.