RFC: Signed Fixed-width Integer Support

pnwamk commented 10 months ago

Proposal

Lean currently supports unsigned fixed-width integers (UInt8, UInt16, UInt23, UInt64).

This proposal is regarding similar support for signed fixed-width integers that provide similar efficiency benefits to the UIntN types (e.g., native C representations).

Design Points for discussion

Semantics for overflow and avoiding UB.
- Division by zero returning zero seems an obvious choice here like for existing UIntN types.
- Signed arithmetic overflow in C/C++ is undefined behavior. To avoid this, an implementation could i. simply behave more like Fin and BitVec in Lean4 today and provide predictable wraparound behavior on overflow (there is a slight bias for these semantics in this RFC at the moment), or ii. choose to detect and panic on such occurrences (perhaps similar to Rust in debug mode?)
C representation
- Using the expected intn_t types could make for the least surprising set of C definitions/FFI APIs
- A case for reusing uintn_t types could also be made as they are already used by the compiler for some types
Where would IntN definitions ideally live in the short and long term?
- a GitHub repo dedicated solely to IntN types or similar?
- https://github.com/leanprover/std4?
- https://github.com/leanprover/lean4?

Approaches

Approach 1: `UIntN` Subtypes

A UIntN is used to represent each IntN type, e.g. here is Int8:

structure Int8 where
  ofUInt8 ::
  toUInt8 : UInt8
  deriving DecidableEq, Inhabited, Hashable

Efficient Representation

This approach already provides an efficient representation in compiled code with no compiler extensions by being defined directly in terms of UInt8. I.e., all compiled definitions operating on these Int8s will use uint8_t directly. Note that this does mean that FFI code will need to manually track which generated C definitions are using uint8_t for Int8 and which are using uint8_t for some other type (e.g., UInt8, Bool, etc).

Operators

Arithmetic operations which are "sign-insensitive" can be defined directly in terms of their corresponding UInt8 operation, e.g. Int8.add:

def Int8.add (a b : Int8) : Int8 := ⟨UInt8.add a.toUInt8 b.toUInt8⟩

The remaining few sign-sensitive operations can be given the appropriate implementation in Lean (and backed with a more efficient C representation where needed), e.g., Int8.div:

@[extern "lean_int8_div"]
def Int8.div (a b : Int8) : Int8 :=
  match a.isNeg, b.isNeg with
  | true,  true => ⟨(-a).toUInt8 / (-b).toUInt8⟩
  | true,  false => -⟨(-a).toUInt8 / b.toUInt8⟩
  | false, true => -⟨a.toUInt8 / (-b).toUInt8⟩
  | false, false => ⟨a.toUInt8 / b.toUInt8⟩
instance : Div Int8 := ⟨Int8.div⟩

and the corresponding C definition lean_int8_div:

extern "C" LEAN_EXPORT uint8_t lean_int8_div(uint8_t a, uint8_t b) {
  if ((a == 0) || ((a == INT8_MIN) && (b == ((uint8_t)((int8_t)-1))))) {
    return a;
  } else {
    return ((int8_t) a) / ((int8_t) b);
  }
}

Here's an initial draft for Int8 for feedback here if we want to use this approach:

Potential pros

requires no compiler changes required for efficient representation
requires minimal new Lean code since many Int8 ops simply use the same UInt8 op
requires minimal new C code since UInt8 already provides many appropriate C definitions that are re-used

Potential cons

without some compiler additions, generated C functions operating on IntN types will all be in terms of uintn_t types and not intn_t types. The casts between a uintn_t and intn_t type are free and semantically correct. The book keeping to track which uintn_t means what in the FFI/C layer, however, could be painful and a common Lean4/FFI footgun.

Approach 2: `Fin` Subtype

A Fin is used to represent each IntN type, e.g. here is Int8:

/-- The size of type `Int8`, that is, `2^8 = 256`. -/
abbrev Int8.size : Nat := 256

/--
The type of signed 8-bit two's compliment integers.
-/
structure Int8 where
  val : Fin Int8.size

Efficient Representation

This approach would require a compiler extension like what exists for UIntN types to use the expected intn_t C type. This would obviously mean more up front work, but the resulting compiled C/FFI code would be in terms of the expected signed integer type, which seems likely to be the least surprising choice for a user to encounter.

Operators

Arithmetic operations which are "sign-insensitive" can be defined directly in terms of their corresponding Fin operation in Lean and given the expected efficient representations in C, e.g. Int8.add:

def Int8.add (a b : Int8) : Int8 := ⟨a.val + b.val⟩

The remaining few sign-sensitive operations can be given a more complex definition when appropriate in Lean (along with a more efficient C implementation where needed), e.g., Int8.div:

@[extern "lean_int8_div"]
def Int8.div (a b : Int8) : Int8 :=
  match a.isNeg, b.isNeg with
  | true,  true => ⟨(-a).val / (-b).val⟩
  | true,  false => -⟨(-a).val / b.val⟩
  | false, true => -⟨a.val / (-b).val⟩
  | false, false => ⟨a.val / b.val⟩
instance : Div Int8 := ⟨Int8.div⟩

and the corresponding C definition lean_int8_div:

extern "C" LEAN_EXPORT int8_t lean_int8_div(int8_t a, int8_t b) {
  if ((a == 0) || ((a == INT8_MIN) && (b == -1))) {
    return a;
  } else {
    return a / b;
  }
}

Here's an initial draft for Int8 using Fin for feedback here if we want to use this approach:

Potential pros

has the least surprising C definitions/FFI
defining signed fixed width integers in terms of Fin may be "more direct" or "more natural" than defining them in terms of unsigned fixed-width integers (debatable of course!)

Potential cons

requires some compiler additions to get the expected efficient C representation, likely similar in scope to what exists for UIntN

Approach 3: `Int` Subtypes

Each IntN type could also be defined as a subtype of Int similar to the following:

/--
`IntN n` is an integer `i` with the constraint that `-2^(n-1) ≤ i < 2^(n-1) - 1`.
-/
structure IntN (n : Nat)where
  /-- If `i : IntN n`, then `i.val : ℤ` is the described number. It can also be
  written as `i.1` or just `i` when the target type is known. -/
  val  : Int
  /-- If `i : IntN n`, then `i.2` is a proof that `-2^(n-1) ≥ i.1`. -/
  isGe : LE.le (2^(n-1)) val
  /-- If `i : Fin2C n`, then `i.3` is a proof that `i.1 < (n-1)`. -/
  isLt : LT.lt val (2^(n-1)-1)

Zulip discussion steered me away from this and towards the aforementioned UIntN approach, so less experimenting has been done with this approach.

Potential pros

generated C interfaces are in terms of the signed integer types, which could be more intuitive/less error prone when writing FFI-involved code.

Potential cons

Requires more Lean and C definitions per type than UIntN approach.
Requires compiler extension(s) to achieve a native representation.

Community Feedback

The lack of signed fixed-width integer types has come up in discussion with Lean FRO members and has been discussed some in the lean4 dev community. It was also noted as a pain point in the community when working with external code during Lean Together 2024.

Thank you to those who have helped in discussing this topic so far.

Impact

Add :+1: to issues you consider important. If others benefit from the changes in this proposal being added, please ask them to add :+1: to it.

Related Work

The following also involve supporting signed fixed-width integer types:

Std.Data.BitVec: supports similar functionality but for arbitrary bitwidths and wraps a Nat
SciLean.Data.Int64: Int64 impl using the same proposed design of wrapping an unsigned int (USize instead of than UIntN). The proposed Int8 impl is similar with a few slight differences (e.g., fewer bitwise operations, leverages a few C definitions for efficiency, etc)
Galois.Data.Bitvec: somewhat outdated but still a helpful reference, similar to Std.Data.BitVec

fgdorais commented 10 months ago

Are there considerations to support overflow detection?

pnwamk commented 10 months ago

Are there considerations to support overflow detection?

I'm not aware of specific considerations here as of yet. The tinkering so far has focused on efficiency, avoiding signed integer-related UB we may inherit from C, and just general correctness.

Does anyone know: is there any prior art in this space for the UIntN types? (If no, would such considerations equally apply to both UIntN and IntN types? And, do we want such considerations to be a part of an initial IntN addition/PR, or a part of a follow-up feature/enhancement?)

fgdorais commented 10 months ago

The usual semantics for unsigned types are to wrap around, without overflow. (However, add with carry/subtract with borrow/extended multiplication are often supported.) The overflow semantics for signed types are quite different; they are well supported in LLVM and as builtins in major C compilers. So support is available but might require compiler support.

pnwamk commented 10 months ago

The usual semantics for unsigned types are to wrap around, without overflow. (However, add with carry/subtract with borrow/extended multiplication are often supported.) The overflow semantics for signed types are quite different; they are well supported in LLVM and as builtins in major C compilers. So support is available but might require compiler support.

The proposed Int8 implementation linked in the RFC uses the corresponding unsigned operations where possible. This gives them predictable wrap around semantics and avoids any UB from signed integer overflow, e.g. (127 + 1) : Int8 = -128.

For the sign-sensitive operations, it specifically checks for and avoids the problematic cases which would trigger signed integer overflow (e.g., div and mod explicitly check the few cases where that can happen, so you get predictable albeit perhaps initially odd results like(-128 / -1) : Int8 = -128 instead of UB).

Perhaps I misunderstood what you meant by "support overflow detection". The proposed design explicitly avoids signed integer overflow in the C sense and values will wrap around, but it doesn't attempt to dynamically detect specific cases at runtime in order to warn/error for the user.

fgdorais commented 10 months ago

The runtime detection (with catch support?) is what I'm asking about.

Just to be clear: I'm just asking whether this is currently considered, I don't think this is important for basic signed integer support.

alexkeizer commented 10 months ago

This seems closely related to the Std.BitVec type, which implements a signless approach to bitvectors. For example, it has Std.BitVec.udiv for unsigned division and Std.BitVec.sdiv for signed division. Ultimately, though, the BitVec is just a wrapper around a Nat, just like the proposed IntN being a wrapper around UIntN.

Things like overflow detection and such might make more sense to implement on BitVec, and are in fact on my roadmap for them already. The proposed IntN types seem more suited for when you want to squeeze out every last bit of performance, and thus probably don't want to pay the cost of overflow detection.

pnwamk commented 10 months ago

This seems closely related to the Std.BitVec type

Yes, that code was helpful to reference =) I just added a Related Work section to the RFC I should have included initially acknowledging the related implementations I'm aware of.

Things like overflow detection and such might make more sense to implement on BitVec ... the proposed IntN types seem more suited for when you want to squeeze out every last bit of performance

I am increasingly curious if Lean could reasonably help a user avoid/detect overflow while using these sorts of "primitive" UIntN and IntN types... but I think that may be a topic to dive into separately (at least based on the fact that the UIntN types today also don't feature this). Happy to hear disagreement here though =)

fgdorais commented 10 months ago

C/C++ make signed integer overflow undefined but efficient tools are generally available.

LLVM support: https://llvm.org/docs/LangRef.html#llvm-sadd-with-overflow-intrinsics Clang support: https://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins GCC support: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

A generic operation would have to trap overflow by checking the inputs.

hargoniX commented 5 days ago

This is now implemented through #5790, #5885 and #5961. Additionally UIntX was refactored to use BitVec instead in #5323 so IntX can now make use of the BitVec primitives that we already put in a lot of work over the course of this year.

To the best of our knowledge we have addressed all UB concerns by both mimicking semantics of BitVec in C and enabling the -fwrapv option in the compiler such that overflow etc. is now defined behavior.

Additional things such as proper proof APIs for both UIntX and IntX as well as things like catching overflows are future work items that are not part of this RFC afaict so I'm closing this as completed.

omentic commented 4 days ago

Awesome!! This is very useful, thank you :smiley:

leanprover / lean4