saturated 64-bit curve25519 tracker issue

andres-erbsen commented 4 years ago

The current fastest known (non-vectorized) implementations of curve25519 use 4 limbs, and IIRC do arithmetic modulo 2^256-38. Could we generate C code that does that? When compiled with a conventional C compiler, this code would likely be slower than our current unsaturated output, but it would be uniquely suitable for a case study of low-level compilation performance.

jadephilipoom commented 4 years ago

~From the CLI it fails check_args because the tight-bounds computation comes out higher than the max integer value. So this might require some specialized choices of bounds and carry chain.~

That was the result for unsaturated, I didn't read. Saturated solinas (you did mean solinas, right?) hangs and then gets killed. What's the state of saturated solinas binaries in general, anyway?

JasonGross commented 4 years ago

Maybe you passed the wrong arguments, and had it computing 2^255-19 in unary? ./src/ExtractionOCaml/saturated_solinas curve25519 '2^255-19' 64 finished pretty much instantly and says

check_args
/* Autogenerated: ./src/ExtractionOCaml/saturated_solinas curve25519 2^255-19 64 */
/* curve description: curve25519 */
/* requested operations: (all) */
/* s-c = 2^255 - [(1, 19)] (from "2^255-19") */
/* machine_wordsize = 64 (from "64") */

#include <stdint.h>

#if (-1 & 3) != 3
#error "This code only works on a two's complement system"
#endif

In fiat_curve25519_mul:
Stringification failed on the syntax tree:
(λ x1 x2,
  expr_let x3 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x4 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x5 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x6 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x7 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x8 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x9 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x10 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x11 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x12 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x13 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x14 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x15 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x16 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x17 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x18 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x19 := Z.mul_split(2^64, 19, (uint64_t)x3₂) (* : uint64_t, uint64_t *) in
  expr_let x20 := Z.mul_split(2^64, 19, (uint64_t)x3₁) (* : uint64_t, uint64_t *) in
  expr_let x21 := Z.mul_split(2^64, 19, (uint64_t)x4₂) (* : uint64_t, uint64_t *) in
  expr_let x22 := Z.mul_split(2^64, 19, (uint64_t)x4₁) (* : uint64_t, uint64_t *) in
  expr_let x23 := Z.mul_split(2^64, 19, (uint64_t)x5₂) (* : uint64_t, uint64_t *) in
  expr_let x24 := Z.mul_split(2^64, 19, (uint64_t)x5₁) (* : uint64_t, uint64_t *) in
  expr_let x25 := Z.mul_split(2^64, 19, (uint64_t)x6₂) (* : uint64_t, uint64_t *) in
  expr_let x26 := Z.mul_split(2^64, 19, (uint64_t)x7₂) (* : uint64_t, uint64_t *) in
  expr_let x27 := Z.mul_split(2^64, 19, (uint64_t)x7₁) (* : uint64_t, uint64_t *) in
  expr_let x28 := Z.mul_split(2^64, 19, (uint64_t)x8₂) (* : uint64_t, uint64_t *) in
  expr_let x29 := Z.mul_split(2^64, 19, (uint64_t)x8₁) (* : uint64_t, uint64_t *) in
  expr_let x30 := Z.mul_split(2^64, 19, (uint64_t)x9₂) (* : uint64_t, uint64_t *) in
  expr_let x31 := Z.mul_split(2^64, 19, (uint64_t)x11₂) (* : uint64_t, uint64_t *) in
  expr_let x32 := Z.mul_split(2^64, 19, (uint64_t)x11₁) (* : uint64_t, uint64_t *) in
  expr_let x33 := Z.mul_split(2^64, 19, (uint64_t)x12₂) (* : uint64_t, uint64_t *) in
  expr_let x34 := Z.mul_split(2^64, 19, (uint64_t)x15₂) (* : uint64_t, uint64_t *) in
  expr_let x35 := x19₂ << 65 (* : [0x0 ~> 0x240000000000000000] *) in
  expr_let x36 := x19₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x37 := x20₂ * 2 (* : uint64_t *) in
  expr_let x38 := x20₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x39 := x21₂ * 2 (* : uint64_t *) in
  expr_let x40 := x21₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x41 := x22₂ * 2 (* : uint64_t *) in
  expr_let x42 := x22₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x43 := x23₂ * 2 (* : uint64_t *) in
  expr_let x44 := x23₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x45 := x24₂ * 2 (* : uint64_t *) in
  expr_let x46 := x24₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x47 := x25₂ * 2 (* : uint64_t *) in
  expr_let x48 := x25₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x49 := x26₂ * 2 (* : uint64_t *) in
  expr_let x50 := x26₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x51 := x27₂ * 2 (* : uint64_t *) in
  expr_let x52 := x27₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x53 := x28₂ * 2 (* : uint64_t *) in
  expr_let x54 := x28₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x55 := x29₂ * 2 (* : uint64_t *) in
  expr_let x56 := x29₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x57 := x30₂ * 2 (* : uint64_t *) in
  expr_let x58 := x30₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x59 := x31₂ * 2 (* : uint64_t *) in
  expr_let x60 := x31₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x61 := x32₂ * 2 (* : uint64_t *) in
  expr_let x62 := x32₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x63 := x33₂ * 2 (* : uint64_t *) in
  expr_let x64 := x33₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x65 := x34₂ * 2 (* : uint64_t *) in
  expr_let x66 := x34₁ * 2 (* : [0x0 ~> 0x1fffffffffffffffe] *) in
  expr_let x67 := Z.add_get_carry(2^64, (uint64_t)x18₂, ([0x0 ~> 0x1fffffffffffffffe])x42) (* : uint64_t, uint64_t *) in
  expr_let x68 := Z.add_with_get_carry(2^64, (uint64_t)x67₂, (uint64_t)x17₂, 0) (* : uint64_t, uint1_t *) in
  expr_let x69 := x68₂ + x16₂ (* : uint64_t *) in
  expr_let x70 := Z.add_get_carry(2^64, (uint64_t)x67₁, ([0x0 ~> 0x1fffffffffffffffe])x44) (* : uint64_t, uint64_t *) in
  expr_let x71 := Z.add_with_get_carry(2^64, (uint64_t)x70₂, (uint64_t)x68₁, 0) (* : uint64_t, uint1_t *) in
  expr_let x72 := Z.add_with_get_carry(2^64, (uint1_t)x71₂, (uint64_t)x69, 0) (* : uint64_t, uint1_t *) in
  expr_let x73 := Z.add_get_carry(2^64, (uint64_t)x70₁, (uint64_t)x45) (* : uint64_t, uint1_t *) in
  expr_let x74 := Z.add_with_get_carry(2^64, (uint1_t)x73₂, (uint64_t)x71₁, ([0x0 ~> 0x1fffffffffffffffe])x38) (* : uint64_t, uint64_t *) in
  expr_let x75 := Z.add_with_get_carry(2^64, (uint64_t)x74₂, (uint64_t)x72₁, 0) (* : uint64_t, uint1_t *) in
  expr_let x76 := Z.add_get_carry(2^64, (uint64_t)x73₁, (uint64_t)x47) (* : uint64_t, uint1_t *) in
  expr_let x77 := Z.add_with_get_carry(2^64, (uint1_t)x76₂, (uint64_t)x74₁, ([0x0 ~> 0x1fffffffffffffffe])x40) (* : uint64_t, uint64_t *) in
  expr_let x78 := Z.add_with_get_carry(2^64, (uint64_t)x77₂, (uint64_t)x75₁, ([0x0 ~> 0x240000000000000000])x35) (* : uint64_t, uint64_t *) in
  expr_let x79 := Z.add_get_carry(2^64, (uint64_t)x76₁, ([0x0 ~> 0x1fffffffffffffffe])x52) (* : uint64_t, uint64_t *) in
  expr_let x80 := Z.add_with_get_carry(2^64, (uint64_t)x79₂, (uint64_t)x77₁, (uint64_t)x41) (* : uint64_t, uint1_t *) in
  expr_let x81 := Z.add_with_get_carry(2^64, (uint1_t)x80₂, (uint64_t)x78₁, ([0x0 ~> 0x1fffffffffffffffe])x36) (* : uint64_t, uint64_t *) in
  expr_let x82 := Z.add_get_carry(2^64, (uint64_t)x79₁, ([0x0 ~> 0x1fffffffffffffffe])x54) (* : uint64_t, uint64_t *) in
  expr_let x83 := Z.add_with_get_carry(2^64, (uint64_t)x82₂, (uint64_t)x80₁, (uint64_t)x43) (* : uint64_t, uint1_t *) in
  expr_let x84 := Z.add_with_get_carry(2^64, (uint1_t)x83₂, (uint64_t)x81₁, (uint64_t)x37) (* : uint64_t, uint1_t *) in
  expr_let x85 := Z.add_get_carry(2^64, (uint64_t)x82₁, (uint64_t)x55) (* : uint64_t, uint1_t *) in
  expr_let x86 := Z.add_with_get_carry(2^64, (uint1_t)x85₂, (uint64_t)x83₁, ([0x0 ~> 0x1fffffffffffffffe])x50) (* : uint64_t, uint64_t *) in
  expr_let x87 := Z.add_with_get_carry(2^64, (uint64_t)x86₂, (uint64_t)x84₁, (uint64_t)x39) (* : uint64_t, uint1_t *) in
  expr_let x88 := Z.add_get_carry(2^64, (uint64_t)x18₁, ([0x0 ~> 0x1fffffffffffffffe])x46) (* : uint64_t, uint64_t *) in
  expr_let x89 := Z.add_with_get_carry(2^64, (uint64_t)x88₂, (uint64_t)x85₁, (uint64_t)x57) (* : uint64_t, uint1_t *) in
  expr_let x90 := Z.add_with_get_carry(2^64, (uint1_t)x89₂, (uint64_t)x86₁, (uint64_t)x51) (* : uint64_t, uint1_t *) in
  expr_let x91 := Z.add_with_get_carry(2^64, (uint1_t)x90₂, (uint64_t)x87₁, (uint64_t)x49) (* : uint64_t, uint1_t *) in
  expr_let x92 := Z.add_get_carry(2^64, (uint64_t)x88₁, ([0x0 ~> 0x1fffffffffffffffe])x48) (* : uint64_t, uint64_t *) in
  expr_let x93 := Z.add_with_get_carry(2^64, (uint64_t)x92₂, (uint64_t)x89₁, ([0x0 ~> 0x1fffffffffffffffe])x60) (* : uint64_t, uint64_t *) in
  expr_let x94 := Z.add_with_get_carry(2^64, (uint64_t)x93₂, (uint64_t)x90₁, (uint64_t)x53) (* : uint64_t, uint1_t *) in
  expr_let x95 := Z.add_with_get_carry(2^64, (uint1_t)x94₂, (uint64_t)x91₁, (uint64_t)x6₁) (* : uint64_t, uint1_t *) in
  expr_let x96 := Z.add_get_carry(2^64, (uint64_t)x92₁, ([0x0 ~> 0x1fffffffffffffffe])x56) (* : uint64_t, uint64_t *) in
  expr_let x97 := Z.add_with_get_carry(2^64, (uint64_t)x96₂, (uint64_t)x93₁, (uint64_t)x61) (* : uint64_t, uint1_t *) in
  expr_let x98 := Z.add_with_get_carry(2^64, (uint1_t)x97₂, (uint64_t)x94₁, (uint64_t)x59) (* : uint64_t, uint1_t *) in
  expr_let x99 := Z.add_with_get_carry(2^64, (uint1_t)x98₂, (uint64_t)x95₁, (uint64_t)x9₁) (* : uint64_t, uint1_t *) in
  expr_let x100 := Z.add_get_carry(2^64, (uint64_t)x96₁, ([0x0 ~> 0x1fffffffffffffffe])x58) (* : uint64_t, uint64_t *) in
  expr_let x101 := Z.add_with_get_carry(2^64, (uint64_t)x100₂, (uint64_t)x97₁, (uint64_t)x63) (* : uint64_t, uint1_t *) in
  expr_let x102 := Z.add_with_get_carry(2^64, (uint1_t)x101₂, (uint64_t)x98₁, (uint64_t)x10₁) (* : uint64_t, uint1_t *) in
  expr_let x103 := Z.add_with_get_carry(2^64, (uint1_t)x102₂, (uint64_t)x99₁, (uint64_t)x10₂) (* : uint64_t, uint1_t *) in
  expr_let x104 := Z.add_get_carry(2^64, (uint64_t)x100₁, ([0x0 ~> 0x1fffffffffffffffe])x62) (* : uint64_t, uint64_t *) in
  expr_let x105 := Z.add_with_get_carry(2^64, (uint64_t)x104₂, (uint64_t)x101₁, (uint64_t)x65) (* : uint64_t, uint1_t *) in
  expr_let x106 := Z.add_with_get_carry(2^64, (uint1_t)x105₂, (uint64_t)x102₁, (uint64_t)x13₁) (* : uint64_t, uint1_t *) in
  expr_let x107 := Z.add_with_get_carry(2^64, (uint1_t)x106₂, (uint64_t)x103₁, (uint64_t)x12₁) (* : uint64_t, uint1_t *) in
  expr_let x108 := Z.add_get_carry(2^64, (uint64_t)x104₁, ([0x0 ~> 0x1fffffffffffffffe])x64) (* : uint64_t, uint64_t *) in
  expr_let x109 := Z.add_with_get_carry(2^64, (uint64_t)x108₂, (uint64_t)x105₁, (uint64_t)x14₁) (* : uint64_t, uint64_t *) in
  expr_let x110 := Z.add_with_get_carry(2^64, (uint64_t)x109₂, (uint64_t)x106₁, (uint64_t)x14₂) (* : uint64_t, uint1_t *) in
  expr_let x111 := Z.add_with_get_carry(2^64, (uint1_t)x110₂, (uint64_t)x107₁, (uint64_t)x13₂) (* : uint64_t, uint1_t *) in
  expr_let x112 := Z.add_get_carry(2^64, (uint64_t)x108₁, ([0x0 ~> 0x1fffffffffffffffe])x66) (* : uint64_t, uint64_t *) in
  expr_let x113 := Z.add_with_get_carry(2^64, (uint64_t)x112₂, (uint64_t)x109₁, (uint64_t)x17₁) (* : uint64_t, uint64_t *) in
  expr_let x114 := Z.add_with_get_carry(2^64, (uint64_t)x113₂, (uint64_t)x110₁, (uint64_t)x16₁) (* : uint64_t, uint64_t *) in
  expr_let x115 := Z.add_with_get_carry(2^64, (uint64_t)x114₂, (uint64_t)x111₁, (uint64_t)x15₁) (* : uint64_t, uint64_t *) in
  expr_let x116 := x72₂ + x75₂ + x78₂ + x81₂ + x84₂ + x87₂ + x91₂ + x95₂ + x99₂ + x103₂ + x107₂ + x111₂ + x115₂ (* : uint64_t *) in
  ((uint64_t)x112₁ :: (uint64_t)x113₁ :: (uint64_t)x114₁ :: (uint64_t)x115₁ :: [], (uint64_t)x116)
)
Errors in converting fiat_curve25519_mul to C:
Final bounds check failed on second argument to Z.add_get_carry; expected an unsigned 64-bit number (relaxed to uint64), but found a uint128.
Final bounds check failed on second return value of Z.add_get_carry; expected an unsigned 1-bit number (relaxed to uint1), but found a uint64.

Fatal error: exception Failure("Synthesis failed")

JasonGross commented 4 years ago

I find x19₂ << 65 very suspicious

andres-erbsen commented 4 years ago

yes I mean Solinas. and I would like to repeat the guess that perhaps the modulus would be 2^256-38

JasonGross commented 4 years ago

yes I mean Solinas. and I would like to repeat the guess that perhaps the modulus would be 2^256-38

check_args
/* Autogenerated: ./src/ExtractionOCaml/saturated_solinas curve25519 2^256-38 64 */
/* curve description: curve25519 */
/* requested operations: (all) */
/* s-c = 2^256 - [(1, 38)] (from "2^256-38") */
/* machine_wordsize = 64 (from "64") */

#include <stdint.h>

#if (-1 & 3) != 3
#error "This code only works on a two's complement system"
#endif

In fiat_curve25519_mul:
Stringification failed on the syntax tree:
(λ x1 x2,
  expr_let x3 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x4 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x5 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x6 := Z.mul_split(2^64, (uint64_t)x1[3], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x7 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x8 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x9 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x10 := Z.mul_split(2^64, (uint64_t)x1[2], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x11 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x12 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x13 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x14 := Z.mul_split(2^64, (uint64_t)x1[1], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x15 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[3]) (* : uint64_t, uint64_t *) in
  expr_let x16 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[2]) (* : uint64_t, uint64_t *) in
  expr_let x17 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[1]) (* : uint64_t, uint64_t *) in
  expr_let x18 := Z.mul_split(2^64, (uint64_t)x1[0], (uint64_t)x2[0]) (* : uint64_t, uint64_t *) in
  expr_let x19 := Z.mul_split(2^64, 38, (uint64_t)x3₂) (* : uint64_t, uint64_t *) in
  expr_let x20 := Z.mul_split(2^64, 38, (uint64_t)x3₁) (* : uint64_t, uint64_t *) in
  expr_let x21 := Z.mul_split(2^64, 38, (uint64_t)x4₂) (* : uint64_t, uint64_t *) in
  expr_let x22 := Z.mul_split(2^64, 38, (uint64_t)x4₁) (* : uint64_t, uint64_t *) in
  expr_let x23 := Z.mul_split(2^64, 38, (uint64_t)x5₂) (* : uint64_t, uint64_t *) in
  expr_let x24 := Z.mul_split(2^64, 38, (uint64_t)x5₁) (* : uint64_t, uint64_t *) in
  expr_let x25 := Z.mul_split(2^64, 38, (uint64_t)x6₂) (* : uint64_t, uint64_t *) in
  expr_let x26 := Z.mul_split(2^64, 38, (uint64_t)x7₂) (* : uint64_t, uint64_t *) in
  expr_let x27 := Z.mul_split(2^64, 38, (uint64_t)x7₁) (* : uint64_t, uint64_t *) in
  expr_let x28 := Z.mul_split(2^64, 38, (uint64_t)x8₂) (* : uint64_t, uint64_t *) in
  expr_let x29 := Z.mul_split(2^64, 38, (uint64_t)x8₁) (* : uint64_t, uint64_t *) in
  expr_let x30 := Z.mul_split(2^64, 38, (uint64_t)x9₂) (* : uint64_t, uint64_t *) in
  expr_let x31 := Z.mul_split(2^64, 38, (uint64_t)x11₂) (* : uint64_t, uint64_t *) in
  expr_let x32 := Z.mul_split(2^64, 38, (uint64_t)x11₁) (* : uint64_t, uint64_t *) in
  expr_let x33 := Z.mul_split(2^64, 38, (uint64_t)x12₂) (* : uint64_t, uint64_t *) in
  expr_let x34 := Z.mul_split(2^64, 38, (uint64_t)x15₂) (* : uint64_t, uint64_t *) in
  expr_let x35 := x19₂ << 64 (* : [0x0 ~> 0x250000000000000000] *) in
  expr_let x36 := Z.add_get_carry(2^64, (uint64_t)x18₂, (uint64_t)x22₁) (* : uint64_t, uint1_t *) in
  expr_let x37 := x36₂ + x17₂ (* : uint64_t *) in
  expr_let x38 := Z.add_get_carry(2^64, (uint64_t)x36₁, (uint64_t)x23₁) (* : uint64_t, uint1_t *) in
  expr_let x39 := Z.add_with_get_carry(2^64, (uint1_t)x38₂, (uint64_t)x37, 0) (* : uint64_t, uint1_t *) in
  expr_let x40 := x39₂ + x16₂ (* : uint64_t *) in
  expr_let x41 := Z.add_get_carry(2^64, (uint64_t)x38₁, (uint64_t)x24₂) (* : uint64_t, uint1_t *) in
  expr_let x42 := Z.add_with_get_carry(2^64, (uint1_t)x41₂, (uint64_t)x39₁, (uint64_t)x20₁) (* : uint64_t, uint1_t *) in
  expr_let x43 := Z.add_with_get_carry(2^64, (uint1_t)x42₂, (uint64_t)x40, 0) (* : uint64_t, uint1_t *) in
  expr_let x44 := Z.add_get_carry(2^64, (uint64_t)x41₁, (uint64_t)x25₂) (* : uint64_t, uint1_t *) in
  expr_let x45 := Z.add_with_get_carry(2^64, (uint1_t)x44₂, (uint64_t)x42₁, (uint64_t)x21₁) (* : uint64_t, uint1_t *) in
  expr_let x46 := Z.add_with_get_carry(2^64, (uint1_t)x45₂, (uint64_t)x43₁, ([0x0 ~> 0x250000000000000000])x35) (* : uint64_t, uint64_t *) in
  expr_let x47 := Z.add_get_carry(2^64, (uint64_t)x44₁, (uint64_t)x27₁) (* : uint64_t, uint1_t *) in
  expr_let x48 := Z.add_with_get_carry(2^64, (uint1_t)x47₂, (uint64_t)x45₁, (uint64_t)x22₂) (* : uint64_t, uint1_t *) in
  expr_let x49 := Z.add_with_get_carry(2^64, (uint1_t)x48₂, (uint64_t)x46₁, (uint64_t)x19₁) (* : uint64_t, uint1_t *) in
  expr_let x50 := Z.add_get_carry(2^64, (uint64_t)x47₁, (uint64_t)x28₁) (* : uint64_t, uint1_t *) in
  expr_let x51 := Z.add_with_get_carry(2^64, (uint1_t)x50₂, (uint64_t)x48₁, (uint64_t)x23₂) (* : uint64_t, uint1_t *) in
  expr_let x52 := Z.add_with_get_carry(2^64, (uint1_t)x51₂, (uint64_t)x49₁, (uint64_t)x20₂) (* : uint64_t, uint1_t *) in
  expr_let x53 := Z.add_get_carry(2^64, (uint64_t)x50₁, (uint64_t)x29₂) (* : uint64_t, uint1_t *) in
  expr_let x54 := Z.add_with_get_carry(2^64, (uint1_t)x53₂, (uint64_t)x51₁, (uint64_t)x26₁) (* : uint64_t, uint1_t *) in
  expr_let x55 := Z.add_with_get_carry(2^64, (uint1_t)x54₂, (uint64_t)x52₁, (uint64_t)x21₂) (* : uint64_t, uint1_t *) in
  expr_let x56 := Z.add_get_carry(2^64, (uint64_t)x18₁, (uint64_t)x24₁) (* : uint64_t, uint1_t *) in
  expr_let x57 := Z.add_with_get_carry(2^64, (uint1_t)x56₂, (uint64_t)x53₁, (uint64_t)x30₂) (* : uint64_t, uint1_t *) in
  expr_let x58 := Z.add_with_get_carry(2^64, (uint1_t)x57₂, (uint64_t)x54₁, (uint64_t)x27₂) (* : uint64_t, uint1_t *) in
  expr_let x59 := Z.add_with_get_carry(2^64, (uint1_t)x58₂, (uint64_t)x55₁, (uint64_t)x26₂) (* : uint64_t, uint1_t *) in
  expr_let x60 := Z.add_get_carry(2^64, (uint64_t)x56₁, (uint64_t)x25₁) (* : uint64_t, uint1_t *) in
  expr_let x61 := Z.add_with_get_carry(2^64, (uint1_t)x60₂, (uint64_t)x57₁, (uint64_t)x31₁) (* : uint64_t, uint1_t *) in
  expr_let x62 := Z.add_with_get_carry(2^64, (uint1_t)x61₂, (uint64_t)x58₁, (uint64_t)x28₂) (* : uint64_t, uint1_t *) in
  expr_let x63 := Z.add_with_get_carry(2^64, (uint1_t)x62₂, (uint64_t)x59₁, (uint64_t)x6₁) (* : uint64_t, uint1_t *) in
  expr_let x64 := Z.add_get_carry(2^64, (uint64_t)x60₁, (uint64_t)x29₁) (* : uint64_t, uint1_t *) in
  expr_let x65 := Z.add_with_get_carry(2^64, (uint1_t)x64₂, (uint64_t)x61₁, (uint64_t)x32₂) (* : uint64_t, uint1_t *) in
  expr_let x66 := Z.add_with_get_carry(2^64, (uint1_t)x65₂, (uint64_t)x62₁, (uint64_t)x31₂) (* : uint64_t, uint1_t *) in
  expr_let x67 := Z.add_with_get_carry(2^64, (uint1_t)x66₂, (uint64_t)x63₁, (uint64_t)x9₁) (* : uint64_t, uint1_t *) in
  expr_let x68 := Z.add_get_carry(2^64, (uint64_t)x64₁, (uint64_t)x30₁) (* : uint64_t, uint1_t *) in
  expr_let x69 := Z.add_with_get_carry(2^64, (uint1_t)x68₂, (uint64_t)x65₁, (uint64_t)x33₂) (* : uint64_t, uint1_t *) in
  expr_let x70 := Z.add_with_get_carry(2^64, (uint1_t)x69₂, (uint64_t)x66₁, (uint64_t)x10₁) (* : uint64_t, uint1_t *) in
  expr_let x71 := Z.add_with_get_carry(2^64, (uint1_t)x70₂, (uint64_t)x67₁, (uint64_t)x10₂) (* : uint64_t, uint1_t *) in
  expr_let x72 := Z.add_get_carry(2^64, (uint64_t)x68₁, (uint64_t)x32₁) (* : uint64_t, uint1_t *) in
  expr_let x73 := Z.add_with_get_carry(2^64, (uint1_t)x72₂, (uint64_t)x69₁, (uint64_t)x34₂) (* : uint64_t, uint1_t *) in
  expr_let x74 := Z.add_with_get_carry(2^64, (uint1_t)x73₂, (uint64_t)x70₁, (uint64_t)x13₁) (* : uint64_t, uint1_t *) in
  expr_let x75 := Z.add_with_get_carry(2^64, (uint1_t)x74₂, (uint64_t)x71₁, (uint64_t)x12₁) (* : uint64_t, uint1_t *) in
  expr_let x76 := Z.add_get_carry(2^64, (uint64_t)x72₁, (uint64_t)x33₁) (* : uint64_t, uint1_t *) in
  expr_let x77 := Z.add_with_get_carry(2^64, (uint1_t)x76₂, (uint64_t)x73₁, (uint64_t)x14₁) (* : uint64_t, uint1_t *) in
  expr_let x78 := Z.add_with_get_carry(2^64, (uint1_t)x77₂, (uint64_t)x74₁, (uint64_t)x14₂) (* : uint64_t, uint1_t *) in
  expr_let x79 := Z.add_with_get_carry(2^64, (uint1_t)x78₂, (uint64_t)x75₁, (uint64_t)x13₂) (* : uint64_t, uint1_t *) in
  expr_let x80 := Z.add_get_carry(2^64, (uint64_t)x76₁, (uint64_t)x34₁) (* : uint64_t, uint1_t *) in
  expr_let x81 := Z.add_with_get_carry(2^64, (uint1_t)x80₂, (uint64_t)x77₁, (uint64_t)x17₁) (* : uint64_t, uint1_t *) in
  expr_let x82 := Z.add_with_get_carry(2^64, (uint1_t)x81₂, (uint64_t)x78₁, (uint64_t)x16₁) (* : uint64_t, uint1_t *) in
  expr_let x83 := Z.add_with_get_carry(2^64, (uint1_t)x82₂, (uint64_t)x79₁, (uint64_t)x15₁) (* : uint64_t, uint1_t *) in
  expr_let x84 := x43₂ + x46₂ + x49₂ + x52₂ + x55₂ + x59₂ + x63₂ + x67₂ + x71₂ + x75₂ + x79₂ + x83₂ (* : uint64_t *) in
  ((uint64_t)x80₁ :: (uint64_t)x81₁ :: (uint64_t)x82₁ :: (uint64_t)x83₁ :: [], (uint64_t)x84)
)
Error in converting fiat_curve25519_mul to C:
Final bounds check failed on second (carry) return value of Z.add_with_get_carry; expected an unsigned 1-bit number (relaxed to uint1), but found a uint64.

Fatal error: exception Failure("Synthesis failed")

I think

expr_let x35 := x19₂ << 64 (* : [0x0 ~> 0x250000000000000000] *) in

is wrong, and there may be something off about the arithmetic template?

andres-erbsen commented 4 years ago

yeah, seems so.

I'd like to check the status of a long-time feature request of mine: what is currently the nicest way to iterate on the definition of an arithmetic function and get to see it's compilation with some arguments? My bar for actually jumping in and trying to fix that is being able to do this iteration within one coq file, I think. it's entirely plausible that what I'm asking for is already there and I'm just not aware where.

JasonGross commented 4 years ago

https://github.com/mit-plv/fiat-crypto/blob/f423db33f2e6453c6a44ff1ab66a05c8843159c8/src/SlowPrimeSynthesisExamples.v#L1817-L1877

Feel free to leave a record of your iteration in a module at the top of src/SlowPrimeSynthesisExamples.v (this is what I've been doing every time I debug the pipeline). If you want me to get you set up and leave a small self-contained module at the top for you to start with, I can do that.

JasonGross commented 4 years ago

@andres-erbsen I've added an example module for you: https://github.com/mit-plv/fiat-crypto/blob/a510d075f82a066d5a9ee2b6c2212dd5c7fefd23/src/SlowPrimeSynthesisExamples.v#L36-L98

andres-erbsen commented 4 years ago

Thanks!

JasonGross commented 4 years ago

And here's a version with pretty much all the definitions inlined in a single file, for convenience: https://github.com/mit-plv/fiat-crypto/blob/3f58bfab8d6cb098f0a75d3738e03e939478356c/src/SlowPrimeSynthesisExamples.v#L100-L304

andres-erbsen commented 4 years ago

Thanks for setting this up for me. Our pipeline has come a long way since we first pushed stuff through it, I had an actually decent experience working on this. :tada:

I managed to get it to produce plausibly working code at https://github.com/mit-plv/fiat-crypto/pull/777/files#diff-74f074bf7846fa0e0ffa02be193422c5 . x19₂ << 64 was produced because nreductions was one and it produced a high half of a partial product (whose one input was 38 and other was the highest word) with weight s.

There are three things that I think could be done next:

[ ] check that the final carry output is always zero, or change carrying if it isn't. @jadephilipoom might be able to just do it from memory?
[ ] do one add-with-carry pass per pair of input limbs adding up three words per output index (old output word at that index, high half of previous partial product, low half of this partial product) and threading through two carry bits. If output words are not kept in registers (which I think they won't fit into), this reduces the number of spill-reloads of output words.
[ ] generalize this beyond curve25519
[ ] then, proofs

andres-erbsen commented 4 years ago

... and now I'm wondering whether the dropping of carry outputs in intermediate flattening steps (use of fst) is actually correct either

JasonGross commented 4 years ago

Nice!

Thanks for setting this up for me. Our pipeline has come a long way since we first pushed stuff through it, I had an actually decent experience working on this. 🎉

:-). I'm curious what the worst remaining pain-points were. My simulation suggests that it might be the 5--10 second turn-around time, or the fact that the intermediate stages are opaque, or the output format, but I could also see it being none of those.

andres-erbsen commented 4 years ago

yeah, these. I think I just wanted partial evaluation without bounds most of the time, so I used sed to drop bounds and types from the output. Several times I was wondering what the value of an intermediate expression was that was eventually inlined into something else and disappeared; some print-debugging facility (e g generating comments in the output) could be helpful.

JasonGross commented 4 years ago

Mostly skipping bounds is pretty easy, just replace Some boundsn with None.

andres-erbsen commented 4 years ago

Yes, but with that most of the output is still range information.

JasonGross commented 4 years ago

x19₂ << 64 was produced because nreductions was one and it produced a high half of a partial product (whose one input was 38 and other was the highest word) with weight s.

I'm trying to figure out how to generalize this to other primes. The reasoning for it being 1 is: https://github.com/mit-plv/fiat-crypto/blob/d6acb3c966fe4e895a7f9171f3033647dc9a6dfa/src/PushButtonSynthesis/SaturatedSolinas.v#L85-L89 This will happen in general with any prime whose c has just a single term which is not a multiple of a positive power of 2. Is the correct generalization "we need twice as many reductions" or "we need one extra reduction at the very end to deal with a possible lingering high half" or something else?

JasonGross commented 4 years ago

check that the final carry output is always zero, or change carrying if it isn't.

You can see what bounds analysis picks up on for this by replacing None (* Should be: Some r[0~>0]%zrange, but bounds analysis is not good enough *) with Some r[0~>0]%zrange, and looking at the error message to see the computed bounds (right now, for the standard satmul with nreductions := 2, it computes the range 0 ~> 0xc, and for your adjusted satmul that drops intermediate carries, it computes the range 0 ~> 1; idk if this is a failure of the algorithm or a failure of bounds analysis (possibly related to https://github.com/mit-plv/fiat-crypto/issues/326#issuecomment-405033088))

JasonGross commented 4 years ago

By the way, the code for 2^32-1 on 32-bit machines might be instructive here:

Error (
 Computed bounds (Some [Some [0x0 ~> 0xffffffff]], (Some [0x0 ~> 0x1])) are not tight enough (expected bounds not looser than (Some [Some [0x0 ~> 0xffffffff]], (Some [0x0 ~> 0x0]))).
 The bounds [0x0 ~> 0x1] are looser than the expected bounds [0x0 ~> 0x0]
 When doing bounds analysis on the syntax tree:
 (λ x1 x2,
   expr_let x3 := Z.mul_split(([0x100000000 ~> 0x100000000])2^32, (uint32_t)x1[0], (uint32_t)x2[0]) (* : uint32_t, [0x0 ~> 0xfffffffe] *) in
   expr_let x4 := Z.add_get_carry(([0x100000000 ~> 0x100000000])2^32, (uint32_t)x3₁, ([0x0 ~> 0xfffffffe])x3₂) (* : uint32_t, uint1_t *) in
   ((uint32_t)x4₁ :: [], (uint1_t)x4₂)
 )

 which can be pretty-printed as:
 /*
  * Input Bounds:
  *   arg1: [[0x0 ~> 0xffffffff]]
  *   arg2: [[0x0 ~> 0xffffffff]]
  * Output Bounds:
  *   out1: None
  *   out2: None
  */
 void f(uint32_t out1[1], uint1* out2, const uint32_t arg1[1], const uint32_t arg2[1]) {
   uint32_t x1;
   uint32_t x2;
   mulx_u32(&x1, &x2, (arg1[0]), (arg2[0]));
   uint32_t x3;
   uint1 x4;
   addcarryx_u32(&x3, &x4, 0x0, x1, x2);
   out1[0] = x3;
   *out2 = x4;
 }

 with input bounds ((Some [Some [0x0 ~> 0xffffffff]]), Some [Some [0x0 ~> 0xffffffff]]).

)

Note here that even if we add the carry back at the bottom, we still run into the same the issue as in https://github.com/mit-plv/fiat-crypto/issues/326#issuecomment-404850162 : The bounds analysis machinery cannot see that the carry is 1 iff the non-carry part is <= 2^32-2

mit-plv / fiat-crypto

saturated 64-bit curve25519 tracker issue #776