ponylang / rfcs

RFCs for changes to Pony
https://ponylang.io/
61 stars 48 forks source link

Make Array iterator classes private to hide implementation from client #193

Open pmetras opened 2 years ago

pmetras commented 2 years ago

Proposal to make Array iterator classes ArrayKeys, ArrayValues and ArrayPairs private. Instead Array function keys, values and pairs will return general Iterator interface like other stdlib classes that implement theses functions. Hiding the implementation of these iterators follow good design principles.

A new interface Rewindable has been added for when an iterator can be rewinded.

These changes have been implemented in my repository of the Pony compiler and all tests pass.

EDIT: Rendered RFC: https://github.com/pmetras/rfcs/blob/rfc-array/text/0000-array-private-classes.md

SeanTAllen commented 2 years ago

If this is discussed at a sync that I'm not present at before I have a chance to comment more fully, I would like it noted that I am not in favor of this change.

I want to note this note as I probably won't have time to put together a full write up within the next few days.

jemc commented 2 years ago

Apart from the naming suggestion I gave above, I am in favor of this change :+1:

pmetras commented 2 years ago

A bit of context for this RFC.

I'm presently learning Pony, reading all documentation and particularly the one of the stdlib. Looking at the Array class, I was surprised that the API was unveiling array iterators implementations, while traditional API design promotes encapsulation. Looking at other classes in the stdlib, I saw that there were other classes that make public their internal implementation while others did not.

 $ find . -type f -name '*.pony' -exec grep "fun values" {} \; -print
  fun values(): Iterator[this->ByteSeq box]
./packages/builtin/std_stream.pony
  fun values(): (Iterator[this->A] & Rewindable[this->A]) =>
./packages/builtin/array.pony
  fun values(): Iterator[this->A]^
./packages/builtin/read_seq.pony
  fun values(): StringBytes^ =>
./packages/builtin/string.pony
  fun values(): Iterator[this->A]^
./packages/builtin/seq.pony
  fun values(): ListValues[A, this->ListNode[A]]^ =>
./packages/collections/list.pony
  fun values(): MapValues[K, V, H, this->HashMap[K, V, H]]^ =>
./packages/collections/map.pony
  fun values(): Iterator[val->A]^ =>
  fun values(): Iterator[val->A]^ =>
./packages/collections/persistent/list.pony
  fun values(): Iterator[A]^ =>
./packages/collections/persistent/set.pony
  fun values(): SetValues[A, H, this->HashSet[A, H]]^ =>
./packages/collections/set.pony
  fun values(): Iterator[this->A]^ =>
./packages/collections/heap.pony
  fun values(): Iterator[this->String]^ => strings.values()
./packages/cli/command_spec.pony

I couldn't find the logic but the fact that some results concern interfaces and some concrete classes. When designing classes, exposing internals is often considered bad design. But it could be related to some Pony constraints that I don't know about yet. So I decided to ask the question on Zulip regarding the class Array. Sean T Allen listed his arguments but I did not understood them as they were neither in favour or against such API design decision. And he decided that he wouldn't answer my questions on the subject but if I open an RFC.

A few days passed and I was still thinking about that API decision, and because I think that there must be some rational in any decision, I thought that it could be because of a technical constraint imposed by the language. So I decided to test it by myself to see if there was something in Pony that prevented it. I did the changes in the Array class in the compiler and ran all the tests, and they passed without problem. So that was probably about something else I don't understand...

Here is the link to my branch of Pony compiler that implement the changes: https://github.com/pmetras/ponyc/tree/rfc-array

Following Sean T Allen's recommendation that the discussion should continue with a RFC, that's what I did.

One of Sean's argument is that naming things is important: "I personally find naming the thing that implements the interface to be a good thing. I have a thing I can refer to. A name that can be reused."

Thinking about this argument, from a type or object perspectives is not important, in my opinion, when client code call values or keys functions. The client only needs to know that she is getting an opaque object of an unknown type to her (that's the private class), that respect the Iterator* return type of the functions... I thought that this could be different in the situation of using composition, instead of inheritance, to extend this opaque object. But first, that's the Array data structure that a client generally wants to extend and not ArrayValues, and secondly composition extends through redefinition of functions in the new types. So hidden classes still remain hidden. And like I've shown in the RFC example of code using ArrayValues, client have no reasons to use these types.

All my attempts to find a rational failed. I must say I don't understand Sean's argument. I agree that naming thing for public types are important, but I don't see why you need to name things that clients of your class don't need to know (private implementation classes). That's against information hiding/encapsulation. I don't think that Pony syntax or stdlib promotes not enforcing this design principle.

pmetras commented 2 years ago

And I forgot to say that Array is only one situation where this occurs. If one would want to simplify the stdlib API, this change could be applied in other places like List, Map, Set, etc.

ergl commented 2 years ago

I don't have a strong opinion on this RFC, but if Array already returned an Iterator interface, I'd probably vote against an RFC that wanted to change them to return concrete types as it does now.

My only concern about this RFC is adding the overhead of virtual calls for these iterators. I assume iterating over arrays is very common, so these specific classes are used a lot, even if users don't name them.

jemc commented 2 years ago

@ergl - I had the same concern about virtual calls, though I think in practice most iterator-using use cases are direct enough (e.g. passing straight into a for loop) that LLVM "should" be able to inline enough of the code to devirtualize the calls by seeing the concrete type descriptor being written in the same context where the otherwise-virtual calls are being made.

This is a hypothesis we should probably test prior to accepting the RFC. I can do some testing of that if it looks like the RFC has enough support to be accepted.

jemc commented 2 years ago

@SeanTAllen mentioned in the sync call that he'd like to see one of two things happen with this RFC:

He wants to hone in on one of those two directions so that he can better focus his energy in stating his objections, either to the general case (:one:) or to the builtin-specific reasoning (:two:)

I think this a reasonable request, and agree.

pmetras commented 2 years ago

My initial concern was only to get an answer to the question in the title of this RFP that I could understand. It could have been "It's that way because I decided that it is", without any later discussion... But I was intrigued by the fact that the design decision is against the Sanctus encapsulation/information hiding principle of computer science. And I decided to open this RFC to get a definite answer.

Now, if I had to select between direction 1️⃣ or 2️⃣, I think that applying a design principle only on part of a library would be a bad decision by creating an exception (the less exceptions, the better the language). The remaining paths are either correct everything ( 2️⃣) or do nothing (cancel and close the RFC).

Personally, I think that 2️⃣ is the best move, but I don't know yet what could be the impact on existing code or other consequences of such change. But I'm ready to investigate the impacts and update the text of the RFC in that direction if others think it's worth my effort.

jemc commented 2 years ago

@pmetras - I'm confused by your use of numbering - can you please check the numbers in the above comment I gave to make sure they match the use of your numbers - I think you meant :one: when you said :two: - can you confirm?

We discussed this on today's sync call, and Sean suggested you go ahead and try to update the RFC to make your arguments in favor of the approach for either :one: or :two: - that's not a guarantee the RFC will be accepted but it's to ensure he (and others) fully understand the points being argued so that he (and others) can voice any relevant objections clearly.

pmetras commented 2 years ago

I'm confused by your use of numbering - can you please check the numbers in the above comment I gave to make sure they match the use of your numbers - I think you meant when you said - can you confirm?

Effectively, I mixed the numbers. I meant [one], that is applying the design change to all stdlib.

SeanTAllen commented 2 years ago

This is very hard to read in GitHub due to lack of line breaks. Can you add hard line breaks around column 80 or so to make it easier to read in the GitHub web UI?

SeanTAllen commented 2 years ago

I think this RFC doesn't consider the cost of breaking user code enough. Even if I was to be convinced of the value of this change, I think we need to give more weight to considering that this will break people's code and a good justification beyond "design principles" needs to be given for breaking code.

Regardless of how I feel about this specific RFC, I would make the same point for any RFC. I would like to see the breaking change taking more seriously and consideration given to justifying that breaking change.

RFCs set precedent for future changes and the breaking change in here feels somewhat cavalierly presented.

SeanTAllen commented 2 years ago

@pmetras thank you for the linebreaks. its much easier to read now.

jemc commented 2 years ago

@SeanTAllen asked me to weigh in on this RFC to talk a bit about my reasoning, since he knows I support this RFC.

I believe that minimizing API surface is an important component in an overall strategy of stabilizing an API surface.

To be clear about my terms, and to further explain:

When I talk about "stabilizing an API surface", I mean the process of carefully designing an API surface in such a way that the need for future breaking changes is prevented to the greatest extent that is reasonable. That design process includes being clear and explicit about which API guarantees you are making to the API consumers.

It also includes carefully considering which API guarantees are actually helpful to the API consumers, so that guarantees which are not helpful to API consumers can be excluded. And that's what I mean when I talk about "minimizing an API surface". An ideal stable API will minimize the set of guarantees it makes to include only the most helpful set (above some particular threshold of "helpfulness"), because every guarantee made by a stable API is a fetter that may weigh you down during the process of introducing future improvements.

To quote Antoine de Saint-Exupéry:

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

And that's the simplest reason why I am in favor of this RFC. Currently the API makes a guarantee that the return value of each of these functions is a specific concrete class with a specific public name. That particular guarantee does not actually serve any practical needs of users (except, arguably, the virtual calls caveat that was mentioned above). If it is not helpful to users in practice, it can be removed. And if it can be removed, then it should be removed, in the name of taking another step toward stabilizing the Pony standard library's API.

That's my basic argument.

It's worth mentioning that if this change was being debated in reverse (the current standard library had minimal interfaces for the return values, and somebody proposed to make them public concrete classes instead, because they found some tangible benefit in doing so), then it would not be a breaking change at all, because the new proposed type would be a subtype of the existing return types.

It'll always be easier to get more specific with these return types than more general. So we should default to offering the most general return type that is still useful to users, and ratchet down to more specific subtypes as needed (as tangible use cases for more specific return types are discovered). Ratcheting down will always be possible without a breaking change, but ratcheting up will be a breaking change each time. It's best to make such "ratchet up" breaking changes like this RFC before we reach Pony 1.0.0 as part of a general effort to make Pony and its standard library stable.

SeanTAllen commented 2 years ago

@jemc and I discussed this at length on Friday and came up with a version of this RFC that we can both agree upon. One of us will be adding information about what agreed upon as that we could both support.

SeanTAllen commented 2 years ago

Things Joe and I discussed and agreed upon (@jemc please add any additional points I missed adding):

Given all of the above points, we are in favor of this RFC if it was to be rewritten to take all of the above points and justifications into account thereby codifying for future decision making, why this RFC was accepted and what impact if any it has on precedence for making future decisions.

SeanTAllen commented 2 years ago

The only thought I've had so far for the name of the iterator that collections iterators would conform to is CollectionIterator.

jemc commented 2 years ago

I have not yet thought of a better name than that.

pmetras commented 2 years ago

Sorry for the delay in responding, I'm quite busy these days. I'll try to complete the open points before next week-end.

I've seen that you want to limit the changes to the collections module, and that's the place where the Iterator interface is used. And the code changes are effectively limited to that module.

But I'm questioning about the use of interfaces in Pony when compared with other languages (where I'm more accustomed to call them traits). For me, a Pony interface is used to present a type view (I don't have a better term) of an object to a function that expects that this objects implements some functions. The receiving function can be assured, from that parameter object that implements the type view, that it'll be able to call the functions defined by the interface. That's a contract between the receiving function and the parameter. The receiving function is not concerned by other type views (other Pony interfaces) that the parameter would implement, nor is it concerned if the return types from the interface functions are subtypes of the types as defined in the interface. These are minimal requirements for the receiving function when using interfaces.

So I was surprised when I saw that Pony developers wanted to make the return types explicit, as they were subtypes of the minimum types, as this was particularly flagrant with the Array iterators functions. For this reason, I want to audit the whole stdlib source and check if this development pattern is used elsewhere with other interfaces. This is not really part of this RFC and won't be included, but for trying to understand what could justify it. Why would one need to unveil implementation classes, as it it still a question without explanations in my mind?

pmetras commented 2 years ago

A way to solve the CollectionIterator vs RewindableIterator naming issue is to reconsider what I suggested in a previous discussion with @jemc but that we put aside: using functional traits composition and create interface Rewindable. But at that time I did an error in the way I wrote it, as it was dependant on Iterator.

Instead, write:

interface Iterator[A]
  """Something we can iterate on"""
  fun ref has_next(): Bool
  fun ref next(): A ?

interface Rewindable[A]
  """Something that can be rewound"""
  fun ref rewind(): A

Now you can create rewindable iterators for Array and non-rewindable ones for other collections with something like the following untested code:

type RewindableIterator[A] is (Iterator[A] & Rewindable[Iterator[A]])

class _ArrayValues[A: A, B: Array[A] #read] is RewindableIterator[B->A] ref

That way, one can really build functional interfaces/traits from composition of basic interfaces/traits.

SeanTAllen commented 2 years ago

We dont want to introduce a number of not needed interfaces into builtin @pmetras. We want a single interface for collectins iterators because we expect them to change in lockstep.

pmetras commented 2 years ago

Yes, but you can't have a single interface to describe two different functional behaviours: rewindable vs non-rewindable iterators.

The way I explained is the way to limit the number of specialized interfaces. Imagine that later there's a new kind of iterator where you can ask for the number of remaining elements in the iteration. The way to solve it is to create a RemainingCountable interface an use composition to match with the different types, not creating all the variations RemainingCountableIterator, RemainingCountableRewindableIterator, RewindableIterator, etc.

SeanTAllen commented 2 years ago

We want 1 interface for all collections because we want them to move in lockstep. All collections iterators should be rewindable.

SeanTAllen commented 2 years ago

When discussed during the sync call on Tuesday, consensus amongst attendees was that agreement with the points and suggestions put forth in my summation comment of the "Sean and Joe believe comment".

Attendees were @sgebbie, @kulibali, @redvers, @ergl, @jemc, and @SeanTAllen. I don't believe all attendees voiced an opinion but those who did supported the plan as laid out in the aforementioned comment.

Audio of the call for context is available here.

pmetras commented 2 years ago

Sorry, I've not much time to work on it in the coming days. I'm not sure I'll be able to complete it. Though here are some thoughts I had while browsing the stdlib source. I've numbered them for easier reference, but the sort order is not important.

  1. First, here is how I understand the change in lockstep thing as the given explanations are quite terse. In a collection data structure, you want that all iterator functions that exist (usually keys, values or pairs but some collections have more of them or less) return a CollectionIterator and not an Iterator. If a new function is added later in CollectionIterator, for instance infinite that returns if the iterator is bound or infinite, the goal is for all the collections to benefit from this addition. Like I wrote, changing the contract of an interface by adding new function will break client code (i.e. all clients using CollectionIterator). So usually this is something that is not done nor desired by stdlib authors. I don't understand what justifies this change in lockstep.

  2. What would be the rule for a developer to decide using Iterator vs CollectionIterator? Because she will see signatures for values, for instance, that are sometimes CollectionIterator (in collections or builtin) and other times Iterator (elsewhere in stdlib). It'll be more complex both for the API stdlib user but also the stdlib author to decide which type to use without a simple rule.

  3. Though Range and Reverse are part of collections, are they Iterator or CollectionIterator? If they are CollectionIterator, their rewind function signature must be changed. What will be the impact on existing client code?

  4. Array is not part of collections, but does it need to be considered as a collection? It means that builtin will also knows about CollectionIterator, and probably Pony compiler...

  5. Do we need to consider the builtin interfaces Seq and ReadSeq as Iterator or CollectionIterator? If you consider that a sequence is a collection of items, like Array, how do you consider StringBytes or StringRunes because one would want to be able to rewind iteration on these types? And what about ByteSeqIter as used in streams, where we can't rewind when the stream has been flushed?

  6. Many iterators in collection are built with the following destructive pattern:

    • create: Create a data structure, frequently Array where all the collection items are put.
    • has_next: Test if the data structure is empty.
    • next: Remove an element from the data structure and return it

To include rewind in these iterators, as far as I've understood from a previous comment from Sean, you don't want to rebuild the data structure but use existing content, meaning that the content will have to stay in the data structure. It requires to rewrite the whole iterator code. What is the impact on data structure elements capabilities? Do we need to declare them tag or val?

Even if some of you have a definite opinion on the best solution to implement this RFC, I'm still thinking on the various options, balancing impact on existing and future code.

SeanTAllen commented 2 years ago

Like I wrote, changing the contract of an interface by adding new function will break client code (i.e. all clients using CollectionIterator). So usually this is something that is not done nor desired by stdlib authors. I don't understand what justifies this change in lockstep.

Adding a method to an interface will only break implementers and the intention is that the only implementers are in the standard library. Any user code that accepts a CollectionIterator will not break.

SeanTAllen commented 2 years ago

Array is not part of collections, but does it need to be considered as a collection? It means that builtin will also knows about CollectionIterator, and probably Pony compiler...

For the sake of this conversation when Joe and I say "all collections", that includes Array. A full enumeration of exactly what uses CollectionIterator and what doesn't should be included in the RFC. Anything that is unclear should be in unresolved to highlight that it would need to be addressed.

I might not have stated explicitly and I apologize for that CollectionIterator would be part of the builtin package.

SeanTAllen commented 2 years ago

Many iterators in collection are built with the following destructive pattern:

create: Create a data structure, frequently Array where all the collection items are put. has_next: Test if the data structure is empty. next: Remove an element from the data structure and return it

Can you provide an example of this pattern? I didn't see it upon a quick review.

SeanTAllen commented 2 years ago

Do we need to consider the builtin interfaces Seq and ReadSeq as Iterator or CollectionIterator?

Seq and ReadSeq are interesting and limited cases as they only provide access to values() as it is the bare minimum interface. I would imagine they to be Iterator. They are interfaces not concrete collections like we would be applying CollectionIterator to. ReadSeq and Seq are used outside of the standard library and the intent of CollectionIterator is that it is "standard library only"

SeanTAllen commented 2 years ago

Though Range and Reverse are part of collections, are they Iterator or CollectionIterator? If they are CollectionIterator, their rewind function signature must be changed. What will be the impact on existing client code?

I believe those are CollectionIterator but that would require more thought.

The only change I see to the signature would be "should this return a this like ArrayValues does. Given that can do method chaining with ".>", it seems like rewind on ArrayValues is using the pre-method chaining approach to returning this which wouldn't be needed. I think the key here is to decide on the appropriate signature for rewind and implementing accordingly.

If some existing rewinds changed to have a return type, that isn't going to break code, but if we remove the return value (say from ArrayValues) then it would be a breaking change.

Looking at ArrayValues, that looks like we missed changing the return value when we introduced method chaining. I personally would be in favor of changing the ArrayValues return type and accepting the breakage from that given that we already have accepted breakage to make this change.

pmetras commented 2 years ago
  1. Adding a method to an interface will only break implementers and the intention is that the only implementers are in the standard library. Any user code that accepts a CollectionIterator will not break.

Any existing client code that presently accept a *Key or *Value or *Pairs iterator will have to be recompiled to accept an CollectionIterator instead, but this can be done usually without code change.

But if I implemented a BitSet data structure or other type of collection, that is not part of the stdlib, with a values function, I'll have to adapt my code if I want to offer a collection-compatible interface to clients, meaning that I'll have to implement the new rewind function.

pmetras commented 2 years ago
  1. Can you provide an example of this pattern? I didn't see it upon a quick review.

Here are examples I've found tonight doing a search on shift:

pmetras commented 2 years ago

No one explained the advantages or disadvantages of creating a new type CollectionIterator instead of updating the existing interface Iterator. If CollectionIterator were private to the collection package, I could understand that its definition evolves like the classes it operates on and that changes to that type of iterator would have limited impact (I think that's what Sean T. Allen calls lockstep), but it does not seem to be what's desired. I don't see a major advantage in creating a new type of iterator for collections only, when both iterators are builtin. Iterator could be updated to define rewind with a default implementation, and neither client's nor implementer's code won't be impacted at all...

Please explain why adding CollectionIterator is a better solution than updating Iterator so these arguments can be added to the RFC.

pmetras commented 2 years ago

Also, the iterator contract does not specify if a collection can be updated while iterated. This is a frequent source of bugs in other languages, even sequential ones. I haven't thought how to deal with this situation with Pony and if capabilities can help there. But suppose that we allow some collections to be updated by other actors that we had a new function updated to notify an iterator when the iterated collection has been changed (new elements added, removed or updated) and let the iterator implementer decide what to do. In which iterator interface would you add this updated function?

If it were added by combining different types, for instance like (Iterator[A] & Updatable[A]), implementors would have the flexibility to define precise types, compared with the suggested CollectionIterator interface.

Here, Updatable is something like

interface Updatable[A]
  fun updated(): A
    """Notifies that the type `A` has been updated and returns the new content."""

Please explain why adding CollectionIterator is a better solution than combining types so these arguments can be added to the RFC.

ergl commented 2 years ago

@pmetras This part:

I don't see a major advantage in creating a new type of iterator for collections only, when both iterators are builtin. Iterator could be updated to define rewind with a default implementation, and neither client's nor implementer's code won't be impacted at all...

would not work for iterators that define a streaming sequence. You could, potentially, define an iterator over an infinite stream of events that can't possibly be rewinded, unless you kept all previous elements in memory. We don't want to constraint future uses of the Iterator interface so that all implementations have to be rewindable.

pmetras commented 2 years ago

@ergl , perhaps I did not express correctly my concern. The important word is for collection only. I have previously suggested naming it RewindableIterator or better, in my opinion, having a Rewindable interface.

If Iterator is defined as an interface with a default implementation of rewind that raises an error or does nothing or create a new iterator (behaviour to be defined later), this type could be used even by streams or infinite sequenses that can't be rewound. For instance:

interface Iterator[A}
  fun has_next(): Bool
  fun next(): A ?
  fun rewind() ? => error
    """By default, iterators can't be reset"""

I don't think that such interface is the best solution, but at least it prevents having two types with one that is difficult to limit to collections usage.

SeanTAllen commented 2 years ago

@pmetras This RFC should not modify Iterator[A]. No one is suggesting anything that would result in a change to Iterator.

pmetras commented 2 years ago

I've added the rendered RFC at https://pmetras.github.io/rfcs/ and updated the text.

Now, I've just to find some time to see how to implement it and measure the impact.

rhagenson commented 2 years ago

@pmetras I can only speak for myself, but I am not going to accept an RFC which breaks from our RFC procedure. There is no need to add HTML commits as GitHub will render the RFC's Markdown directly. You can link to that rendering like I did in #190

pmetras commented 2 years ago

@rhagenson , please explain how to have it rendered and how to get a rendered URL. I wasn't able to find it automatically rendered as I wanted to be able to read it and I would have save a few hours today. So I used pandoc and followed Github documentation...

rhagenson commented 2 years ago

@pmetras Thank you! Will you update your first comment with the link to the GitHub rendered Markdown?

SeanTAllen commented 2 years ago

I've repeated numerous times that I am against setting precedent based on a design principle. I didn't read the entire new version of this RFC. I read up to the 3rd mention of this being done because of a design principle and stopped.

I'm against this RFC and at this point, I'm not going to invest more time in it. I can't support the motivation given, do not support it being used to set precedent going forward and have stated that more than once. I have many other Pony related things to do and so I am voicing my opposition to this RFC. I would welcome an RFC that meets the reasoning stated here: https://github.com/ponylang/rfcs/pull/193#issuecomment-1021509909.

As is, it seems clear to me this RFC is not going to be written taking that motivation into account.

pmetras commented 2 years ago

I won't update this text anymore. It seems my English vocabulary and wording are not adequate to describe the objective of the RFC. Anybody who wants to support this proposal can take over and update this text.

rhagenson commented 2 years ago

@pmetras Thank you for the work you have put into this RFC.

I am in support of the core of the RFC so I am still hoping to see it through to the end. However, I am already working on another RFC (#200) which I expect will draw most of my attention. Is there anybody else in the Pony community with the bandwidth needed to take over this RFC?