nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

lent/var/openArray inside objects/containers #178

Closed Araq closed 4 years ago

Araq commented 5 years ago

Nim currently has 3 types that are second-class citizens due to memory safety reasons: var T, lent T, openArray[T]. These can only be used in parameter lists as the underlying implementation is that of a borrowed pointer that must not outlive the callee. See https://nim-lang.org/docs/manual.html#procedures-var-return-type for details about how this is done for var T.

This proposal is about extending the capabilities of these types so that they are allowed inside other types like objects or tuples. A type containing a borrowed pointer (var T / lent T / openArray) becomes borrowed too and a value of such a type must not outlive the location it was borrowed from. This is enforced statically. How exactly this static enforcement will work remains to be seen, but the rules of the first implementation will be:

  1. A borrowed type cannot be used for (local or global or threadlocal) variables or consts. It must be used as a parameter or return type.
  2. If used as a return type, the location must borrow from the first parameter that is passed to the proc. This can later on extended to use a from declaration.

In other words, exactly the same rules apply that apply for https://nim-lang.org/docs/manual.html#procedures-var-return-type.

ToDo: Show what idioms this extension allows for. This RFC is in desparate need of practical examples.

zah commented 5 years ago

I'll give two examples from our codebase (out of many more):

If we have to generalise, let's see what's common between these examples - they both represent types that hide an underlying mutable pointer behind a safer high-level API. The features discussed here will enable you to define and instantiate such types on the stack in a way that will allow the compiler to ensure that memory-safety is preserved (something that our existing solutions don't solve).

The other major new capability will be using these types as return values. Iterators like strutils.split will now return openarrays instead of strings, which will make a whole lot of code much more efficient (less memory allocations, less copying).

And finally, C++ programmers will have less reasons to complain when they discover that Nim doesn't allow them to use the following familiar pattern for avoiding copies when using container accessors:

const SomeType& foo = someVector[i];

Work-arounds with templates or unsafe usages of addr will be reduced.

zah commented 5 years ago

A borrowed type cannot be used for (local or global or threadlocal) variables or consts. It must be used as a parameter or return type.

There is no need to require that borrowed types must not be used for local variables. In fact, most of the benefits will be gained when this becomes possible. Here is how you can lift this restriction:

There are only two ways a borrowed value can be created:

  1. It can be received as an input parameter (please note that besides var and openarray this includes the regular read-only non-ref parameters).

  2. It can be obtained as a return value from a function.

Let's focus on option 2. Consider code like the following:

proc someProc(borrowedFoo: Foo) =
  let borrowedBarSlice = getBarSlice(borrowedFoo)
  useSlice(borrowedBarSlice)

We can safely rewrite it like this:

proc someProc_continuation(barSlice: openarray[Bar]) =
  useSlice(barSlice)

proc someProc(borrowedFoo: Foo) =
  someProc_continuation(getBarSlice(borrowedFoo))

In other words, we can treat everything in the basic block that follows the borrowing of a return value as a new function taking the borrowed value as an extra input.


Where the danger lies is in invalidating the borrowed values. Consider the following classic example:

proc dangerous(x: var seq[Foo]) =
 let elementReference: lent Foo = x[10]
 x.setLen 0 # The memory backing the foo reference is destroyed
            # Let's assume that this will be devastating for the next line.
 echo elementReference.someField

Even this example can be rewritten to have the same hazards in the current version of Nim:

proc dangerous_continuation(elementReference: Foo, x: var seq[Foo]) =
  x.setLen 0 # The memory backing the foo reference is destroyed.
             # Let's assume that this will be devastating for the next line.
  echo elementReference.someField

proc dangerous(x: var seq[Foo]) =
  dangerous_continuation(x[10], x)   

This goes to show that our continuation transformation goes in both directions - any unsafe code can be expressed with or without the local variable restriction. The solution here is to introduce a more powerful borrow checker that will enforce that every mutable alias to a particular "backing resource" is unique. You can either have one mutable alias or multiple read-only ones. The compiler write tracking can help you enforce this on a more granular level. The notions of "backing resource" and a "pointer to it" can be generalized to allow the user to introduce custom user-defined types following the same rules.

Araq commented 5 years ago

Even this example can be rewritten to have the same hazards in the current version of Nim

This problem is actually covered in https://nim-lang.org/docs/manual_experimental.html#aliasing-restrictions-in-parameter-passing and can be solved differently: If the compiler cannot prove there is no dangerous aliasing involved, it can pass a real copy to the Foo value parameter.

mratsim commented 5 years ago

This would also allow to implement C++/D ranges without resorting to raw pointers and ensuring that they don't escape.

We could rewrite a couple of algorithms in strutils and sequtils in way that is ergonomic (functional-like), composable but without the overhead of intermediate allocation and without the need of the It templates like mapIt.

In the Nimbus codebase, everywhere we use this code https://github.com/status-im/nim-stew/blob/master/stew/ranges/memranges.nim was because we couldn't use openarray as values.

Obviously there are all the existing RFCs that must not be forgotten:

zah commented 5 years ago

This problem is actually covered in https://nim-lang.org/docs/manual_experimental.html#aliasing-restrictions-in-parameter-passing and can be solved differently: If the compiler cannot prove there is no dangerous aliasing involved, it can pass a real copy to the Foo value parameter.

Good to see that spec, but my example was attempting to demonstrate that there is equivalence between the problems. To enforce the referenced spec and to lift the "no local variables" restriction, you have to solve the same problem. The implicit copying work-around can be applied in both cases and changing my example to use var creates code that must be outlawed in both cases:

proc dangerous_continuation(elementReference: var Foo, x: var seq[Foo]) =
  x.setLen 0 # The memory backing the foo reference is destroyed.
             # Let's assume that this will be devastating for the next line.
  echo elementReference.someField

proc dangerous(x: var seq[Foo]) =
  dangerous_continuation(x[10], x)   
RSDuck commented 4 years ago

I have another example. A while ago I had a discussion with @krux02 about the interface code he wrote, which I later masacred to use gc refs, to make it safe. This change would make this uncessary (maybe not always), because it can be ensured that the interface object doesn't outlive the implementing object.

cooldome commented 4 years ago

I have one more. What this RFC should allow is for closures to capture lent T / var T variables by borrowing without copy. The same way as objects will have borrowed fields, the closures would have borrowed fields too and and it makes them borrowed as well.

Currently, lent T / var T can't be captured and users complain about lack of this functionality.

mratsim commented 4 years ago

Generic Big Integer suitable for embedded

A practical example of something that is not possible today:

I want to write a big integer library

Typical usecase: as a backend of a cryptographic library for example TLS, Blockchain or Zero-Knowledge Proofs. I expect zero-knowledge proofs to become as important as TLS in the future and there are only few libraries and very few languages are suitable to write those in.

The resulting BigInt type would be

type BigInt = object
  limbs: openarray[uint64]

And for cryptography/modular bigint the type would be

type BigInt = object
  bitSize: uint64
  limbs: openarray[uint64]

What can we do currently

Currently we can use static, proof-of-concept in Constantine: https://github.com/mratsim/constantine/blob/ff8b22e1d179844a8b73df252b19c2b7449210e1/constantine/bigints.nim#L42-L74

type Word* = uint64

const WordBitSize* = sizeof(Word) * 8 - 1
  ## Limbs are using only 63-bit to avoid carries

func wordsRequired(bits: int): int {.compileTime.}=
  (bits + WordBitSize - 1) div WordBitSize

type
  BigInt*[bits: static int] = object
    ## Fixed-precision big integer
    # The word size is 63-bit to avoid carries
    # "Limb-endianess" is little-endian (least significant limb at BigInt.limbs[0])
    limbs*: array[bits.wordsRequired, Word]

However due to the use of static, each operations like add, == will create duplicated code for each bits value in use. There is a tradeoff here. However is very probably that having the following data structure

type BigInt = object
  bitSize: uint64
  limbs: openarray[uint64]

which avoids code monomorphization at the expense of 1 (bitSize) or 2 words (openarray.len) will lead to less total size (library size + runtime size).

Side-note

For this use-case:

mratsim commented 4 years ago

Safe and efficient Range-like transformations

Range-like transformations similar to D-range or C++20-ranges

In particular this enables functional-style (map, filter, fold, reduce, take, drop, ...) including on infinite ranges without:

Closely related

Use-case

Note that with such ranges, we wouldn't have to build a state machine at compile-time like Rust or zero-functional. You would be able to create transformations at runtime as well, for example for a parser constructed from a json or protobuf definition.

Furthermore, we completely dodge the need of loop fusion or deforestation that seem so prevalent in functional languages because there is no intermediate values by design.

mratsim commented 4 years ago

Making strutils fast (and still composable)

Currently strutils is a huge performance trap. Nim beginners are using it, allocating all over the place thanks to the easy but allocation-heavy design. And then they come in the forum saying that Nim is slower than Python, someone says "you shouldn't do that, allocation is bad, etc do it this way or use npeg".

We need a way to easily compose string operations and allocate only when necessary. Having openarray as values would probably enable a library with a strutils like API but much higher performance so that string processing in Nim is not such a performance trap.

Araq commented 4 years ago

That is exactly the opposite of what I asked for, sorry. I asked for example that would actually really work with the outlined borrowing rules. Hint: Borrowing means you have fewer owners, not no owners whatsoever, so I cannot see your BigInt example working out.

mratsim commented 4 years ago

I don't get what you mean. There is an owner, the proc that allocates the backend buffers.

For example in pseudo-code:

type BigInt = object
  # Internal type -> openarray to guarantee lifetime
  bitSize: uint64
  limbs: openarray[uint64]

type Curve* = enum
  Curve25519
  BN254
  BLS12_384
  Secp256k1

func getNumLimbs(C: static Curve): int =
  discard "map curve -> # of bigint limbs"

func getSignatureSize(C: static Curve): int =
  discard "map curve -> signature size"

type FiniteField[N: static int, C: static Curve] = object
  ## Extension Field: all arithmetic operations are done modulo a constant prime / polynomial extension
  # Internal type -> openarray to guarantee lifetime
  coefs: array[N, BigInt] # Polynomial coordinates in the extension field

type ECCPointG2[C: static Curve] = object
  ## Elliptic Curve coordinates on G2 curve
  # Internal type -> openarray to guarantee lifetime
  x: FiniteField[2, C]
  y: FiniteField[2, C]

type Signature*[C: static Curve] = object
  ## Public type
  raw: array[getSignatureSize(C), byte]

proc hashToEllipticCurveG2(point: var ECCPointG2, msg: openarray[byte]) =
  discard "computation"

proc compressSignature(point: ECCPointG2): Signature =
  discard "compression"

proc signMsg(msg: openarray[byte], Curve: static enum): Signature[Curve] =
  var g2x_fp2x, g2x_fp2y, g2y_fp2x, g2y_fp2y: array[getNumLimbs(C), uint64]
  var point = ECCPointG2(
    x: FiniteField[2, C](coord: [g2x_fp2x, g2x_fp2y]),
    y: FiniteField[2, C](coord: [g2y_fp2x, g2y_fp2y])
  )
  # This proc is the owner of all computation buffers
  # We ensure that all computations are done without allocation
  # This includes bigints, modular bigints and elliptic curve maths
  # and that the temporary buffers don't escape
  point.hashToEllipticCurveG2(msg)
  result = point.compressSignature()
Araq commented 4 years ago

Ok, thank you. This example is precise enough that I can see my somewhat simplistic borrow rules would work for you.

mratsim commented 4 years ago

Other simple things that would be allowed having first-class openarrays:

For instance I would benefit from such for some cryptographic APIs:

func aggregateVerify*[T: char or byte](
        publicKeys: openarray[PublicKey],
        messages: openarray[openarray[T]],
        signature: Signature): bool =
  ## Check an aggregated signature over several (publickey, message) pairs
  ## returns `true` if the signature is valid, `false` otherwise.
  discard

func aggregateVerify*[T: char or byte](
        publicKey_msg_pairs: openarray[tuple[publicKey: PublicKey, message: openarray[T]]],
        signature: Signature): bool =
  ## Check an aggregated signature over several (publickey, message) pairs
  ## returns `true` if the signature is valid, `false` otherwise.
  discard
Araq commented 4 years ago

This RFC has been implemented. View-types have a reasonably precise specification. View types are far more flexible than what was outlined in this RFC. The implementation is reasonably stable, they are used successfully in the wild, see for example https://forum.nim-lang.org/t/6968#43685

The found memory-safety holes (that are affecting view types) are covered by the section "Aliasing restrictions in parameter passing" here. We hope to have an implementation for this soon, but this should be covered by a milestone of its own, as it's an existing, well-known problem for an Ada-like language. (With an elegant solution too, which I didn't outline in the experimental manual.)

snej commented 4 years ago

Is there documentation of the view types as implemented?

alaviss commented 4 years ago

Is there documentation of the view types as implemented?

It's in the experimental manual