ocaml / ocaml

The core OCaml system: compilers, runtime system, base libraries
https://ocaml.org
Other
5.19k stars 1.06k forks source link

Immutable arrays #13097

Open OlivierNicole opened 1 month ago

OlivierNicole commented 1 month ago

This PR proposes to add immutable arrays to OCaml. This is a feature originally by @antalsz and has been in use for a bit more than six months at Jane Street, in their fork of the compiler with extensions (below dubbed flambda-backend). Jane Street has asked for help from Tarides to port the feature onto OCaml 5.x.

Motivation

Immutable arrays represent an improvement over regular arrays in several situations:

Benefits on trunk

As an example use case, @dbuenzli’s uucp and uunf libraries store static Unicode data in some big arrays (among other data structures), themselves comprising sub-arrays. It has proved difficult in the past to compile this efficiently, and some array initialisation remains at runtime, since the optimizer is unable to verify that all the data is constant. Using immutable arrays, the optimizer can be extended to allocate it all statically in a read-only data section of the executable.

In addition, the presence of a standard module:

  1. publicises the possibility of using immutable arrays, encouraging more people to consider using them to write safe interfaces;
  2. avoids having several incompatible implementations of the same feature;
  3. leaves open the possibility to implement more optimisations dedicated to immutable arrays in the future.

Benefits on flambda-backend

In addition to the benefits above, immutable arrays work better with locality modes in use at Jane Street: mutable data structures cannot be safely allocated on the stack as it would allow local variables to escape, but immutable arrays and records can.

Design

The Lambda intermediate language already has immutable arrays; therefore, there is little to add. A mutability flag is added to the AST nodes Pexp_array and Texp_array, and a new predefined type 'a iarray is added.

Parser support is added for the form [: e1; e2; …; en :] in array literals and patterns:

let a : int iarray = [: 1;2;3;4;5 :]

let () =
  let open Iarray in
  print_int a.:(2)   (* prints 3 *)

let _ =
  match a with
  | [: :]           -> "empty"
  | [: 1;2;3;4;5 :] -> "1--5"
  | _               -> "who knows?"

A new module Iarray — and the corresponding IarrayLabels — is added to the Stdlib, containing all the functions of Array, excluding the mutating functions. The memory representation of immutable arrays is the same as that of mutable arrays, so Iarray reuses the runtime primitives for regular arrays.

Documentation is yet to be written: we thought it made more sense to first see what are the opinions about the design.

nojb commented 1 month ago

Documentation is yet to be written: we thought it made more sense to first see what are the opinions about the design.

My 2c after a quick look are below. Re-reading it I realize it sounds rather negative, but it is not my intention, please bear that in mind :)

The PR seems way too invasive for such a fringe freature. For example, there is a new syntax for immutable arrays, which we all know is a "big change" (eg we have no such syntax for floatarray). Also, we already have array and floatarray and the fact that they share most operations does not scale very well. Are we going to add a third "array" variant? Shouldn't we add immutable float arrays as well? How is that going to scale?

Also, why not do

module Iarray : sig
  (* signature without "set" functions *)
end = Array

and call it a day? I fail to see why this needs special support in the typechecker, pattern-matching engine, middle-end, compiler primitives, etc. Sure, once in a while having immutable arrays would be nice, but is it that common enough that it justifies adding this much code to the compiler?

  • publicises the possibility of using immutable arrays, encouraging more people to consider using them to write safe interfaces;

I think very few libraries would actually need immutable arrays as part of their public API. Do you have concrete examples?

  • avoids having several incompatible implementations of the same feature;

This is only a problem if libraries start using this data structure in their public APIs. Do you know of any doing so?

  • leaves open the possibility to implement more optimisations dedicated to immutable arrays in the future.

I don't think this alone would be reason enough to merge this PR.

Just my view at the moment, would love to hear other opinions! Cheers.

goldfirere commented 1 month ago

To me, there are two separable concerns here, which we may want to debate separately.

  1. Do we want immutable arrays as a first-class concept in the compiler? This means that the middle ends are aware of the immutability, and we expose primitives for conversion between mutable arrays and immutable ones. Furthermore, it means that we can say that iarray is covariant. (This is a big reason we can't just slap a more restrictive signature on top of Array.) To me, the argument here is not so hard: immutable arrays have some advantages of lists (immutability and covariance) and some advantages of arrays (tighter memory footprint and constant-time random access). I find achieving this combination to be somewhat convincing.

  2. Do we want first-class syntax for immutable arrays? This is harder, because it's permanently(*) consuming a finite resource (the number of one-character widgets we can put inside of brackets for different ordered collections). We explored the possibility of not doing this within Jane Street, and the pushback was around the usefulness of pattern-matching to extract array contents, something impossible to do without dedicated syntax.

() In anticipation of the debate on this PR (I'm working with @OlivierNicole on the Jane Street side), I've been worried about this consumption of syntax. But in the end, I think it's OK. If we want to revisit this decision later, it seems easy enough to add the ability to bind different array-like syntaxes to different implementations. For example, we could have a `module [: .. :] = struct ( the definitions needed to make and match against arrays ) end; when that module is in scope, the use of the syntax[: 1; 2; 3 :]` uses the definitions in that module. This is very much like the existing ability to define custom array-access operators. I am not* proposing such an addition now. Instead, I'm using this idea as a potential way, in the future, to claw back a decision to consume a little bit of syntax, if we so choose.

I do think the comments such as those from @nojb are very helpful here. We like this design within Jane Street, where I can find 158 jbuild files (our equivalent of dune files) that indicate a dependency on our iarray library, which is how we have exposed the functions in the Iarray stdlib module. The reason we've chosen to try to upstream this feature is that we'd like to start using it in our open-source library APIs. And this is all within the last 8 months or so. But the design constraints and desires of the open-source OCaml community are different than those within Jane Street, and so we welcome the debate about what the best overall design is.

OlivierNicole commented 1 month ago

The PR seems way too invasive for such a fringe freature. For example, there is a new syntax for immutable arrays, which we all know is a "big change" (eg we have no such syntax for floatarray).

I understand the concern for unwarranted big changes. Just to give some sense of the contents of the patch here, the few hundreds of lines of the change (I’m not counting bootstrap commits and added tests) consist of the new Iarray module—essentially copied from Array—, syntax-related boilerplate changes, and straightforward modifications to typechecking and Lambda generation to account for the notion of immutable arrays. In addition, it seems to me that the changes would incur close to no maintenance burden.

Also, we already have array and floatarray and the fact that they share most operations does not scale very well. Are we going to add a third "array" variant? Shouldn't we add immutable float arrays as well? How is that going to scale?

Given that arrays and immutable arrays have the same runtime representation, support for immutable arrays does not incur any new runtime primitives (except for unsafe conversion between the two). It’s true that immutable arrays have their own type, which is part of the feature, and thus their have their own module. But unlike float arrays, immutable array support happens at the type level.

Shouldn't we add immutable float arrays as well?

It could be argued that a new module Float.Iarray should match Float.Array (it is not the case in this PR). If it is deemed useful enough to add, it would result in writing some additional OCaml functions, essentially copied from Float.Array, using the same primitives as Float.Array.

goldfirere commented 1 month ago

For what it's worth, we've been tempted to write floatiarray, but just haven't gotten around to it. I agree that it make sense here.

johnwhitington commented 1 month ago

It's a small thing, but [::] is potentially confusing. It looks like a two pieces of list syntax, one inside the other. If gramatically possible, could it be changed to the two-token sequence [: :]?

OlivierNicole commented 1 month ago

In fact, that’s what it is, it just happens to be valid to leave no space between the two tokens. I agree that it is less confusing with a space.

fpottier commented 3 weeks ago

I believe that offering immutable arrays would be a useful feature. Documenting that an array is immutable helps clarify the code (and can help the compiler optimize it).

In fact, in Osiris, our ongoing program verification project for OCaml, we do intend to (eventually) distinguish immutable and mutable arrays. If this distinction existed in OCaml already, then we would not need to create it at the level of Osiris.

Regarding immutable array literals, I am less interested. Array literals are useful only when the length and content of the array are statically known, which in my experience is fairly rare, but perhaps this depends on the application area. It seems more common to me to 1- allocate a mutable array, 2- initialize it, 3- freeze it.

OlivierNicole commented 3 weeks ago

It is true that the syntax is technically not necessary to create immutable arrays, given our unsafe primitive %array_to_iarray. I see three drawbacks to not having a syntax:

fpottier commented 3 weeks ago
  • The unsafe primitive %array_to_iarray is currently not exposed

I imagine that this primitive operation will have to be exposed, in the same way that Bytes.unsafe_to_string is currently exposed. Otherwise, the construction of an immutable array would be quite inefficient; one would have to allocate a mutable array, initialize it, then copy it into a fresh immutable array.

OlivierNicole commented 2 weeks ago

To clarify, what I meant was that with immutable array literals and functions like Iarray.init, immutable arrays can be created without ever needing an unsafe conversion primitive, which is a benefit from the point of view of safety (from my point of view, the less one needs to use primitives that can crash the program, the better).

gasche commented 3 days ago

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

You mention exposing immutable arrays in APIs to write safe interfaces. I suppose that the ability to do this depends quite a bit on the conversion mechanisms that exist between mutable and immutable arrays. Can you discuss examples of using immutable arrays to write safe interfaces, and detail the conversion mechanisms?

Two counterpoints could be:

  1. Regarding the analogy with string/bytes: string is a datatype that many application domains are happy to use in an immutable way -- because they are willing to copy a lot of data around. Are arrays really similar?

  2. Regarding safe interfaces: one scenario I can easily picture is that I own a mutable array, I want to pass it to some auxiliary function, but I don't expect that function to mutate the array itself. I would be happy to have the guarantee that this function treats the array as immutable. But this requires being able to view a mutable array as an immutable array, which corresponds to a type of "arrays that I am not allowed to mutate (but someone else may)", rather that a type of "arrays that no one will ever change", and those are two different notions. Which one are you proposing? (Or maybe the distinction does not occur in the JS setting due to the judicious use of local modalities to blur the distinction?)

gasche commented 3 days ago

One example of immutable array would be the arrays that are used in the internal representation of Coq/Rocq terms (for reasons of memory compactness, I believe?): https://github.com/coq/coq/blob/d7d392191a367839b0f5a7772e48b8f24e9f1b3e/kernel/constr.ml#L105

dbuenzli commented 3 days ago

In graphical programming it can be more natural to have arrays of points rather than lists of points (a lot of algorithms fiddle with indices, you want to quickly find out how much of them you have etc.). But then when you pass things around you don't want people to fiddle with your arrays. If arrays were immutable I would perhaps expose the lists in this interface as arrays (and the underlying rings's point lists aswell).

But then arrays are much less flexible to construct, split and merge than lists. I would likely prefer an all round efficient persistent polymorphic vector on the line of RRB vectors in the stdlib (it could be based on immutable arrays, but then it could also be based on mutable arrays)

fpottier commented 3 days ago

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

Immutable arrays are just one particular implementation of the abstract concept of "immutable sequence", which arguably is fairly useful and common. The upside of immutable arrays is that they are compact and support efficient random access; their downside is that concatenation and extraction of a subsequence are expensive (in terms of time and space). In some applications, these downsides are tolerable. For instance, Menhir uses immutable arrays in many places to represent sequences of symbols.

In Osiris (a program verification system that we are developing), immutable values are much easier to reason about than mutable values. In short, immutable values "are" just values, whereas mutable values are addresses in memory, and one must explicitly use Separation Logic assertions to describe who is allowed to read or write these pieces of memory. Osiris will have specific support for reasoning about "pure" expressions (expressions that do not allocate, read, or write mutable memory). A program that manipulates (primitive) immutable arrays can be considered pure, whereas a program that manipulates mutable arrays cannot.

fpottier commented 3 days ago

I would likely prefer an all round efficient persistent polymorphic vector on the line of RRB vectors in the stdlib

The library Sek, developed by Arthur Charguéraud (@charguer) and myself, is intended to offer an efficient general-purpose implementation of both persistent and ephemeral sequences (that is, both immutable and mutable sequences). That said, we have received extremely little feedback about it, so I do not know whether it reaches its intended target. Also, it is probably too complex to be comfortably integrated in the standard library.

gasche commented 3 days ago

I could see how to use dynarrays as buffers to programmatically accumulate elements, and then "freeze" them as an immutable array. This might be a reasonable approach for @dbuenzli's GG example where the sequence data structure is used to represent polygons as sequences of rings. (And rings as sequences of points, etc.)

dbuenzli commented 3 days ago

Also, it is probably too complex to be comfortably integrated in the standard library.

Indeed I remember looking at it and finding it too complicated. I would rather like something along the lines of pvec (repo).

lthls commented 3 days ago

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

There are a few places in Flambda2 where we use either arrays or lists, but where immutable arrays would have been the best match. I think there are other places in the compiler where immutable arrays would make sense (a good proportion of the lists in the various compiler IRs would work as well as immutable arrays, and for some of them I expect it would be a better fit). It's possible, for instance, that using immutable arrays instead of lists in the typedtree would reduce a bit the size of the cmi files, leading to slightly shorter compilation times for very big applications (when demarshalling cmi files takes a noticeable amount of time).

gasche commented 3 days ago

(Thinking out loud: changing the .cmi format (to use arrays) is fairly painful due to bootstrappin, so I wouldn't do this for an experiment, but it might be possible to just traverse the type to count the total size of lists of interest, and deduce potetial space savings from that.)

charguer commented 2 days ago

Jumping into the discussion with a few comments.

type 'a iarray = IArray of 'a array

val array_to_iarray : 'a array -> 'a iarray

let my_iarray = IArray [| 1; 2 |]

let _ = 
   match my_iarray with
   | IArray [| a; b |] -> a

First, you would extend the typechecker to check that IArrayBlock is only applied to array literals. Second, you extend the compiler to eliminate all reference to the constructor IArrayBlock and instead refer to your custom representation. I speculate that this approach could be later generalized to support not just one but several possible immutable array representations, using not just one constructor IArray but a family of such constructors.

gasche commented 2 days ago

Another remark: Batteries has long exposed a design of "arrays with read/write capabilities" as the submodule BatArray.Cap (documentation). (As far as I know, there are few users of this feature; a quick search suggests one, two.) This design is polymorphic on the mutability, the type is ('a, 'mut) t. In the case of Batteries, 'mut is a polymorphic variant [< `Read | Write ], but this could also be done with GADTs today.

Having the mutability information as a parameter is strictly more expressive than the current proposal, in particular it makes it possible to express that a function will not mutate an argument that may otherwise be mutable, by having the type f : [< `Read | `Write ] t -> ... -- exposing that any subset of capabilities is accepted. This is different from requiring f : [ `Read ] t -> ..., which enforces that the input argument is read-only for everyone. (Constant literal arrays in read-only memory are still correct at this restricted type [ `Read ] t.)

dbuenzli commented 2 days ago

Having the mutability information as a parameter is strictly more expressive than the current proposal

In my experience these kind of phantom types are also strictly more annoying to work with in practice.

bluddy commented 2 days ago

Having the mutability information as a parameter is strictly more expressive than the current proposal

In my experience these kind of phantom types are also strictly more annoying to work with in practice.

And yet, if you're working with mutable data structures, this kind of thing is very useful in giving you guarantees that approach those of immutable data structures. This is similar to C++'s const mechanism and I believe Rust's concept of mutability is built entirely on this concept. Without using phantom types for this purpose, OCaml's support for mutable data structures is quite inferior to C++'s.

dbuenzli commented 2 days ago

And yet, if you're working with mutable data structures, this kind of thing is very useful in giving you guarantees that approach those of immutable data structures.

Not really. Personally I'm more interested in knowing that an array is immutable than if I'm allowed to mutate it. These are different things.

fpottier commented 2 days ago
  • The notion of immutable arrays is useful to enable compiler optimizations

Indeed, and in particular, CSE: if the array a is immutable, then multiple occurrences of a.(i) can be shared, even if there are function calls in between.

fpottier commented 2 days ago
  • The implementation choice for immutable arrays is critical. Of course, you do need as core building block an implementation of an array as a single block of contiguous cells.

  • Regarding syntax, I would also suggest avoiding introducing new constructors in the parse tree.

I believe that we should separate the discussion of primitive immutable arrays, implemented as a contiguous block of memory, and the discussion of various kinds of "immutable sequence" data structures, which can be packaged as separate libraries.

That said, regarding syntax, you are right: we should strive to avoid the proliferation of array access syntaxes, and we should allow libraries to define their own array access syntaxes, if possible.

fpottier commented 2 days ago

Regarding the distinction between "truly immutable arrays" (immutable for everyone) and "temporarily immutable arrays" (immutable for me, but possibly mutable for someone else; or immutable now, but possibly mutable later), I believe it would be a good thing to discuss just the simpler proposal, namely having a primitive type of truly immutable arrays.

In my opinion, the example of string versus bytes shows that this simple approach can work quite well in practice. Is there any evidence that users have complained about the strict separation between string and bytes?

gasche commented 2 days ago

I'm making the distinction because I can easily think of use-cases for "temporarily immutable arrays", but I have a harder time thinking of use-cases for "truly immutable arrays". A few have been mentioned here.

(I forgot to make a humorous point. @fpottier mentions that they are considering immutable arrays in Osiris because they are much easier to specify and use in proofs. This reminds me of the joke about searching for keys under the street light.)

fpottier commented 2 days ago

This reminds me of the joke about searching for keys under the street light.

Point taken, but there is something to be said in favor of searching under the street light...

dbuenzli commented 2 days ago

but I have a harder time thinking of use-cases for "truly immutable arrays".

Just think about your uses cases for lists :-)

Is there any evidence that users have complained about the strict separation between string and bytes?

One of the problem in the current state of affairs is that there is no good way to turn a bytes into a string without incurring a copy or using unsafe functions.

Of course an abstraction could be provided for that it just doesn't exist for now[^1] and I'm slightly uncomfortable each time I allocate a bytes value, modify it and then return it as a string using Bytes.unsafe_to_string – even though the safety is easy to assert that's not what OCaml was supposed to be to me.

Other than that mismatches between what you have and what an API asks for can be annoying. Something which R/W modalities do solve, but phantom types tend not to be very usable (unwieldy error message, modality propagation in your definitions, less trodden path of variant subtyping syntax), I'm wondering if that's something that wouldn't be better solved by something built-in the language rather than encoded. Note however that this particular problem is not limited to bytes and string, bigarrays of bytes also comme into play. So many byte sequences to choose from.

[^1]: And I'm not sure it's worth adding one. Perhaps Buffer could be cajoled into that, if we stop presenting it as an append-only structure and provide a few guarantees. Same for iarrays, perhaps Dynarray could provide enough guarantees, to offer this (make n has a backing array of n, if I you don't exceed n then Dynarray.to_iarray gives me a zero-copy iarray, and resets its backing array to an empty array).

fpottier commented 2 days ago
  1. Same for iarrays, perhaps Dynarray could provide enough guarantees, to offer this (make n has a backing array of n, if I you don't exceed n then Dynarray.to_iarray gives me a zero-copy iarray, and resets its backing array to an empty array)

Good suggestion! This seems easy to implement inside Dynarray (using an unsafe conversion of a mutable array to an immutable array) and safe for the end user.

c-cube commented 2 days ago

Without using phantom types for this purpose, OCaml's support for mutable data structures is quite inferior to C++'s.

Yes, OCaml is not as good an imperative language as C++, there's nothing new there :slightly_smiling_face: . I agree with @dbuenzli that phantom types are not worth the annoyance, and I regret their use in CCVector. Having an immutable flavor of array also seems like a lot of change for almost no benefit. The only thing you gain compared to doing it in a library, is literals and the ability to match on them; but how often do you match on arrays of a fixed size? I'd rather see a HAMT-style implementation of vectors and maps in the stdlib, to catch up with the SOTA of clojure in 2007.

As of string vs bytes: there's a lot more string literals than array literals, in my experience. Immutable strings are worth it as it's such a fundamental data type. Immutable arrays are a curiosity.

fpottier commented 2 days ago

The only thing you gain compared to doing it in a library, is literals and the ability to match on them

I believe that you can also hope to gain some compiler optimizations (e.g., elimination of redundant loads).

I agree that literal array expressions and literal array patterns do not seem critical.

I would like to have immutable arrays in the language, but I do not really care whether they are primitive or implemented as a library; this probably does not make much difference to the end user.

goldfirere commented 18 hours ago

Thanks for bringing the conversation back to full immutable arrays; temporarily-immutable arrays are interesting, but indeed separate. I also think that considering grander structures on top of immutable arrays, as suggested by @charguer, is an interesting topic, but orthogonal to this proposal. (I believe all the ideas in that post are equally applicable to mutable arrays.)

Although I started out wondering about the necessity of having the syntax, I've become more attached to the syntax as we've been discussing. I tend to see immutable arrays as intending to be an easy replacement for lists, when the extensibility of lists is not needed. That is, if I have a collection of stuff that I do not expect to mutate, I would hardly ever reach for an array, as using an array opens me up to all kinds of questions about mutation. But an immutable array -- especially if the syntax is easy -- is pretty great. It's true that an immutable array won't have the :: syntax used in writing a recursive function, but I don't often do that with lists -- I use List.fold and List.map, easy to replace with Iarray.fold and Iarray.map.

For what its worth, we at Jane Street already have 874 uses of immutable arrays within our codebase. There's been no particular encouragement for adoption -- just some advertising saying that the feature is available.

Of course, all of this is future-compatible with syntax to support arbitrary mutable and immutable sequences. I don't think imagining those other features should stop us from taking this one.