Open ZenithalHourlyRate opened 4 days ago
How concerning! I don't immediately see why the rotate-and-reduce pass isn't properly handling the mulf/addf ops. Maybe you could run with --mlir-print-ir-after-all
and --mlir-print-ir-tree-dir
to print out the IR after each pass in the pipeline, and then you could compare the bgv/ckks versions to see the first pass at which they meaningfully differ.
I suspect the use of addf/mulf is triggering some incorrect match that causes some pattern or pass to not be applied. However, that shouldn't (absent bugs) cause the output to produce an incorrect result, it should just produce a less efficient program.
So looking for other reasons the output might be incorrect: I see the line above after rotating by 6
result decrypted: (9, 16, -2.35721e-15, 1.85306e-15, 3.1336e-15, 5.86141e-16, -2.08615e-15, 2.21339e-15, ... )
I believe those zero values should be nonzero. In particular, I recall the encoding used by the simd-vectorizer passes expects 1D tensors to be repeated to fill up the available ciphertext space, since the rotations analyzed are cyclic mod 8 (in your example, because it's a tensor<8xf16>
) but the openfhe backend uses larger ciphertext sizes. Cf. https://github.com/google/heir/blob/51ebc8dd6eb2d8c431ee20d25a0fb5ec016aa550/tests/Examples/openfhe/simple_sum_test.cpp#L34-L42 and https://github.com/google/heir/issues/645
Could that be causing the incorrectness?
I recall the encoding used by the simd-vectorizer passes expects 1D tensors to be repeated to fill up the available ciphertext space
Oh it is exactly this reason, after copying the cyclic filling code from test/Examples, the result is correct.
std::vector<double> x1 = {1, 2, 3, 4, 1, 2, 3, 4};
int32_t n =
cryptoContext->GetCryptoParameters()->GetElementParams()->GetRingDimension() / 2;
std::vector<double> outputs;
outputs.reserve(n);
for (int i = 0; i < n; ++i) {
outputs.push_back(x1[i % 8]);
}
const auto& ptxt1 = cryptoContext->MakeCKKSPackedPlaintext(outputs);
const auto& c1 = cryptoContext->Encrypt(keyPair.publicKey, ptxt1);
run with
--mlir-print-ir-after-all
and--mlir-print-ir-tree-dir
The difference is that the first apply-folder
after full-loop-unroll
does eliminate the constant 0 (i.e. %c0_si16 = arith.constant 0 : i16
) for BGV as it does is a equivalent transform but does not eliminate the constant 0.0; then rotate-and-reduce can not recognize such pattern (I thought I have supported this pattern in past PRs?).
If I manually delete the constant after full-loop-unroll
and run the passes again, a correct rotate-and-reduce version will come out.
I thought I have supported this pattern in past PRs?
The current pattern is
%cst = arith.constant dense<0.000000e+00> : tensor<8xf16>
%1 = arith.mulf %arg2, %arg3 : tensor<8xf16>
%2 = arith.addf %1, %cst : tensor<8xf16>
%3 = tensor_ext.rotate %2, %c6 : tensor<8xf16>, index
%4 = tensor_ext.rotate %1, %c7 : tensor<8xf16>, index
// mixed using of %1 and %2 afterwards.
Which means %3
and %4
does not have the same root in rotation-analysis.
The constant tensor could be saved so that %3/%4 can have a same root, like mentioned in #522, but it is hard to handle, as the constant tensor gets rotated later. I avoided handling saving tensors in earlier PRs. And for this specific case, I think apply-folders should eliminate this constant tensor.
Full IR below
module {
func.func @dot_product(%arg0: !secret.secret<tensor<8xf16>>, %arg1: !secret.secret<tensor<8xf16>>) -> !secret.secret<f16> {
%c7 = arith.constant 7 : index
%cst = arith.constant dense<0.000000e+00> : tensor<8xf16>
%c6 = arith.constant 6 : index
%0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xf16>>, !secret.secret<tensor<8xf16>>) {
^bb0(%arg2: tensor<8xf16>, %arg3: tensor<8xf16>):
%1 = arith.mulf %arg2, %arg3 : tensor<8xf16>
%2 = arith.addf %1, %cst : tensor<8xf16>
%3 = tensor_ext.rotate %2, %c6 : tensor<8xf16>, index
%4 = tensor_ext.rotate %1, %c7 : tensor<8xf16>, index
%5 = arith.addf %3, %4 : tensor<8xf16>
%6 = arith.addf %5, %1 : tensor<8xf16>
%7 = tensor_ext.rotate %6, %c6 : tensor<8xf16>, index
%8 = arith.addf %7, %4 : tensor<8xf16>
%9 = arith.addf %8, %1 : tensor<8xf16>
%10 = tensor_ext.rotate %9, %c6 : tensor<8xf16>, index
%11 = arith.addf %10, %4 : tensor<8xf16>
%12 = arith.addf %11, %1 : tensor<8xf16>
%13 = tensor_ext.rotate %12, %c7 : tensor<8xf16>, index
%14 = arith.addf %13, %1 : tensor<8xf16>
%extracted = tensor.extract %14[%c7] : tensor<8xf16>
secret.yield %extracted : f16
} -> !secret.secret<f16>
return %0 : !secret.secret<f16>
}
}
Or this case migrated to BGV, dot product having an initial non-zero sum, and rotate-and-reduce can not handle such thing correctly. Should we handle this in insert-rotate so that inserted-rotations can have a same root? Now the mixed using of %1 and %2 is not friendly
func.func @dot_product(%arg0: tensor<8xi16>, %arg1: tensor<8xi16>) -> i16 {
%c0 = arith.constant 0 : index
%c0_si16 = arith.constant 10 : i16
%0 = affine.for %arg2 = 0 to 8 iter_args(%iter = %c0_si16) -> (i16) {
%1 = tensor.extract %arg0[%arg2] : tensor<8xi16>
%2 = tensor.extract %arg1[%arg2] : tensor<8xi16>
%3 = arith.muli %1, %2 : i16
%4 = arith.addi %iter, %3 : i16
affine.yield %4 : i16
}
return %0 : i16
}
We get
module {
func.func @dot_product(%arg0: !secret.secret<tensor<8xi16>>, %arg1: !secret.secret<tensor<8xi16>>) -> !secret.secret<i16> {
%c6 = arith.constant 6 : index
%cst = arith.constant dense<10> : tensor<8xi16>
%c7 = arith.constant 7 : index
%0 = secret.generic ins(%arg0, %arg1 : !secret.secret<tensor<8xi16>>, !secret.secret<tensor<8xi16>>) {
^bb0(%arg2: tensor<8xi16>, %arg3: tensor<8xi16>):
%1 = arith.muli %arg2, %arg3 : tensor<8xi16>
%2 = arith.addi %1, %cst : tensor<8xi16>
%3 = tensor_ext.rotate %2, %c6 : tensor<8xi16>, index
%4 = tensor_ext.rotate %1, %c7 : tensor<8xi16>, index
%5 = arith.addi %3, %4 : tensor<8xi16>
%6 = arith.addi %5, %1 : tensor<8xi16>
%7 = tensor_ext.rotate %6, %c6 : tensor<8xi16>, index
%8 = arith.addi %7, %4 : tensor<8xi16>
%9 = arith.addi %8, %1 : tensor<8xi16>
%10 = tensor_ext.rotate %9, %c6 : tensor<8xi16>, index
%11 = arith.addi %10, %4 : tensor<8xi16>
%12 = arith.addi %11, %1 : tensor<8xi16>
%13 = tensor_ext.rotate %12, %c7 : tensor<8xi16>, index
%14 = arith.addi %13, %1 : tensor<8xi16>
%extracted = tensor.extract %14[%c7] : tensor<8xi16>
secret.yield %extracted : i16
} -> !secret.secret<i16>
return %0 : !secret.secret<i16>
}
}
I'm not sure I fully understand what you're suggesting, but let me try to repeat:
does not eliminate the constant 0.0
If we could figure out why the floating point constant is not folded away, we could solve the immediate problem, but I think in your example with a non-zero initial value of the dot product, this problem would persist in another form. It could maybe be fixed by changing insert-rotate so that it aligns things properly, or maybe it could be changed in rotate-and-reduce to recognize a reduction in a smarter way than looking at a single linear chain (#522).
I support either of those improvements. I think also https://github.com/google/heir/issues/521 might allow a workaround wherein the rotate-and-reduce works for the entire vector except that first element in the chain, which would be nearly optimal and give other side benefits to IRs that don't do complete reductions..
I tried to migrate the bgv dot product example to CKKS, and the result is incorrect. After inspection, it seems that the transformation that heir-simd-vectorizer has done is incorrect.
The input for the pipeline is the belowing; just the bgv example with all i16 substituted with f16.
After running
--mlir-to-secret-arithmetic="entry-function=dot_product"
, we getIt is apparently different from the result of dot product for bgv, where a rotate-and-reduce pattern is working.
If we execute the emitted ckks code with input
(1, 2, 3, 4, 1, 2, 3, 4)
, we get incorrect result, with traces like this: