Efficient way to initialize variables to pass to generated c code

chart21 commented 2 years ago

Hi again, I am looking for an efficient way to initialize the special variables that the C function generated by Usuba expects. Let's say I want to use AVX (DATATYPE = __m256i) and perform function f with 256 independent 8-bit inputs v0...255 and w0...,255. My function input could look like this:

void f__ (/*inputs*/ DATATYPE x__[8],DATATYPE y__[8], /*outputs*/ DATATYPE z__[8])

If I understand the concept correctly, my variables should hold the following values:

__m256i x[0] = v0[0],v1[0],v2[0], ..., v255[0]  // least significant bit of every independent value 
__m256i x[1] = v0[1],v1[1],v2[1],...,v255[1]   // 2nd least significant bit of every independent value
...

So basically x[0] should hold all least significant bits from all variables vi. On the Intel Intrinsics website I find the load functions _mm256_lddqu_si256 which reads 256bits from a memory address. So I could construct a long long variable tmp and use repeated bitshifting to get the least significant bit of every input value to the right position in tmp. Then I pass the long long variable to the load function. In total I then need to construct 16 long long variables for all x and y inputs this way.

Is this the most efficient way to initialize the variables or do you use a different approach? I see that you have a PACK function in the architecture-specific header files that is not implemented yet.

DadaIsCrazy commented 2 years ago

Hi Christopher,

Transposing a square matrix can be done efficiently (ie, better than moving bits one by one), cf https://usubalang.github.io/usuba/2020/01/14/bitslicing.html#transposition. This is the algorithm you can find in Usuba's C headers, eg https://github.com/usubalang/usuba/blob/main/arch/STD.h#L121-L134.

However, this is not so easy in your case, because your matrix is not square (8x256). Still, I have written a similar piece of code for the Rectangle cipher (64 bits block size): https://github.com/usubalang/benchmarks/blob/main/bench/rectangle/bitslice/stream.c#L167-L213. You should be able to replace for (int i = 0; i < 6; i ++) by for (int i = 0; i < 3; i ++) and for (int j = 0; j < 64; j += (2 * n)) by for (int j = 0; j < 8; j += (2 * n)), and that should work for you (i should go from 0 to log2(input_size) = log2(8) = 3 for you, and j should go from 0 to input_size = 8 for you).

Funny thing with that algorithm is that your n-th input won't end up in the bits of the n-th rank, but rather in the bits of the (n/8+n%8)th rank. This is not an issue though, since you use the same algorithm at the end for recovering the original format.

You can benchmark this with your current algorithm (if you've implemented it yet), and you should see that it's (much) faster.

Let me know if it doesn't work for some reasons ;-)

DadaIsCrazy commented 2 years ago

Whups, I just realized that my (n/8+n%8) formula is completely wrong. The proper formula I think is that the n-th input ends up in the bits of the (n%32)*8+n/32-th rank of each of the 8 registers. I've drawn a small illustration: https://docs.google.com/drawings/d/13BYfhBXAoqvmuHiLODQ4Sc1mG9t5vRyX-ZrPDXaHEbg/edit?usp=sharing (for 64-bit registers rather than 256-bit AVX, so replace 32 by 8 in the formula above). (but, once again, you shouldn't really care about that, since, anyways, bitsliced code doesn't care of where inside each register your data are, as long as each bit of your 8-bit input is in a different register)

chart21 commented 2 years ago

Hi Darius, thanks for the explanation. It worked great.

I made the following adjustements:


#define PARALLEL_FACTOR 256

...
 for (int i = 0; i < 3; i ++)
    for (int j = 0; j < 8; j += (2 * n))
...

#define simple_xor_with_ortho(a,b,c) {                                     \
    real_ortho_256x8((DATATYPE*) a);                             \
    real_ortho_256x8((DATATYPE*) b);                              \
    simpleXor__((DATATYPE*)a,(DATATYPE*)b,(DATATYPE*)c);       \
    real_ortho_256x8((DATATYPE*) c);                             \

As i had only 8 bit integers I did not need bswap.

Do you also have a function that works with flexible matrix sizes?

For instance 256x5, 512x12, ...

I want to utilize Bitslicing for general purpose Secure Multiparty Computation with boolean gates. Hence, I don't know in advance which kind of circuit the parties want to compute. Also I do not know the input sizes they require, so handcrafting these assignments for each circuit is not really an option. 256x5 might be interesting for cases where we know that an input variable has at most a value of 32 and we do not want to "waste" computation (and in case of SMC also additional network communicaiton between peers) by creating a 256x8 matrix.

So ideally I would need a function that takes as inputs an arbitrary BITLENGTH and REGISTERSIZE to fill a set of REGISTERSIZExBITLENGTH variables.

Setting j= BITLENGTH and PARALLELFACOTR = REGISTERSIZE would be the easy step.

Let's say BITLENGTH=12. I guess in this case I would still need to create a unit16 variable and use __builtin_bswap16 before calling real_ortho().

How would I need to set i though if log(BITLENGTH) is not int and I want to use 12 instead of 16 variables at the end?

Edit: I just tested it with a BITLENGTH of 5 using uint_8 and I had to modify the Usuba generated code by offsetting the indices. E.g.

c[0] = a[0] + b[0] -> c[3] = a[3] + b[3]

While this works, there might be a better way by not creating the 3 skipped variables in the first place. Maybe a breaking condition I could insert in the real_ortho() code?

chart21 commented 2 years ago

Unfortunately, I noticed that the same executable fails in some cases with a segmentation fault and executes in others. I think it is a memory problem.

When debugging it fails in the following line:

__m256i u = _mm256_and_si256(data[j + k], mask_l[i]);

With the error message: EXC_BAD_ACCESS (code=EXC_I386_GPFLT) This also happens with the original 64-bit input conversion written for rectangle. I also get a warning, e.g.:

incompatible pointer types passing 'uint64_t [64]' to parameter of type '__m256i *' [-Wincompatible-pointer-types]gcc

So I think casting the uint pointer to DATATYPE* fails sometimes. Do you have any idea how to solve this?

DadaIsCrazy commented 2 years ago

Regarding your last message: my guess is that your input buffer is not aligned on 32 bytes, which is not quite supported by AVX. To fix it, allocate your input buffer with either aligned_alloc instead of malloc, or use __attribute__((aligned(32)) if your buffer is statically allocated. If you can't change the way the buffer is allocated, then you'll have no choice but to copy your data into a 32-byte-aligned buffer.

I don't have much time right now to reply to your previous message, but, in short, I've looked into generalizing the transposition before, but didn't get very far; it's not trivial. If your input size is a power of 2, then things work nicely. Otherwise, it's much harder... Unfortunately, I don't have the time to look into it now, so I'm afraid that you're on your own... Good luck! :)

chart21 commented 2 years ago

Thanks again for your help. The aligned_alloc indeed solved the issue. I will look into the generalization problem myself then. Maybe you could answer also some high-level questions?

Does it make sense to evaluate the compiled code with multiple Threads/cores to evaluate the circuit REGISTERSIZE*NUM_CORES times in parallel? Does Multithreading or Multiprocessing make more sense in this case? Something else I should pay attention to, like how many registers my CPU has to avoid spilling?
How expressive is the Usuba language? I noticed I can type a*b that gets translated into MUL. But MUL is not defined in the headers. Is there an overview what functions I can use? For instance some existing compilers that compile high-level source code into boolean circuits also allow for expressions like: c = if_else(a > b, a, c) which then gets translated into the boolean gates that calculate this expression. I did not find any hint that this is possible with .ua code.
Why does ADD work? When you type a + b in Usuba it gets compiled into ADD(a[0],b[0],1) and so on. STD.h simply ignores the 1 and performs a[0] + b[0], a[1] + b[1], ... But if the variables are bitsliced I don't understand why this would result into a valid reconstruction of a+b after unorthogonalizing.

For instance let's assume that I want to calculate a : u3 + b : u3. One input could be 001 and the other input 011. So c = a+b should be 100. For simplicity I assume that all other 255*2 input variables only contain 0. If the input bits of a,b,c end up at the most significant entry of the bitsliced variables, I get:

a[0]=0000... b[0]=0000... c[0]= 0000... a[1]=0000... b[1]=1000... c[1]=1000... a[2]=1000... b[2]=1000... c[2]= 0000...

I basically lost the carries in the calculation and received an incorrect result. In case my other variables are not all 0s they could also mess up the calculation further with their carries. I thought that In general addition with bitsliced code requires constructing an adder with XOR and AND gates. Or is there a clever trick here that I am missing?

usubalang / usuba

Efficient way to initialize variables to pass to generated c code #66