nim-lang / RFCs

A repository for your Nim proposals.
136 stars 26 forks source link

Optional {}-Keyed/[]-positional access syntax for B-Trees #207

Closed c-blake closed 1 year ago

c-blake commented 4 years ago

So, over at https://github.com/nim-lang/Nim/pull/13890 I raised this question that may have broader interest and so warrants broader exposure. This executive summary is that binary trees/B-Trees/even a sorted seq support multi-path access. There are a couple of ways we could try to support this nicely in Nim .

What I think looks nicest is x{key} to reference an element by key with x[i] to reference it by position/rank. If the sequence is not sorted or the tree is "keyless"/"unkeyed" then only [] works. If it is keyed, both work, but they are only different for sorted seq/B-trees which support them. Hash tables with no well-defined positional access (or at least ambiguously defined - the Nth data[] element or the the Nth occupied data element?) would just have both {} and [] mean the same thing. I guess we could deprecate [] for hash tables for many years or something, but that might be too unfriendly to legacy code/programmer brains. I'm not advocating that, but it is a possibility.

A layered approach might be similar to the way [^1] works with new type(s) to distinguish the kind of access. This would also probably need some kind of helper operator like ^ to be easy to use as [Rank[i]] is maybe a bit verbose. (EDITED for clarity.)

There is also a capability for search trees to be augmented to support "interval queries" (https://en.wikipedia.org/wiki/Interval_tree), but the use cases for that are vastly more specialized than just a sorted seq which is kind of a bread&butter programming situation and interval augmentation makes less sense to be "on by default" the way rank augmentation may. So, while that [Rank[i]] idea may be extensible in this direction, that extension may not make as much sense in the stdlib { which might become more like "nuclear reactor included" than "batteries included" :-) }.

There are also questions about the nicest way generic-parameter-wise to make rank augmentation optional (more or less mathematically "dual" to how to make "keep it sorted" behavior optional for a fancier-than-we-have seq type). Some kind of [void|int] generic type parameter might make the most sense and might even generalize to [void|int|interval] for that aforementioned interval augmentation.

Anyway, it would be nice to hear feedback/what people think before going too far down any particular road. (Although having {} be ^1-like syntactic sugar for [Rank[i]] may be enough to be all things to all Nimmers.)

c-blake commented 4 years ago

Oops - I misspoke/typed in that final ()s. We would want {} to be ^1-like for [Keyed[]] not [Rank[]]. That's actually kind of the core mismatch the whole RFC is about since there would be no [Keyed[]] need for hash tables since they cannot do both [Keyed[]] and [Rank[]]. Sorry.

c-blake commented 4 years ago

Also, it would probably add value to just have a true SortedSeq in the stdlib. That is actually A) the core impl of the simplest B-tree Node, B) useful in its own right { its absence leads some toward poor solutions (as in R with Srunmed.c as I've previously mentioned]}, and C) almost trivial with the binary searches already in algorithm, but most relevantly here D) has all the same "interface issues" being raised in this RFC. We could hammer them all out for SortedSeq and then make the B-tree match.

c-blake commented 4 years ago

(A sortedSeq can probably also provide as much Interval interface assistance. I am not a game developer, but my impression was that interval search trees can be useful in that realm as well as in scheduling software. Anyway, I doubt those cases are common enough to warrant special operator syntactic sugar, but they might inform relations between types.)

c-blake commented 4 years ago

Tagging @stefansalewski since I know he did an RTree (a https://en.m.wikipedia.org/wiki/Grid_file with possibly B-tree-based edge indices is another option there) and instead of just [Rank[]] and [Keyed[]] we may also want a multidimensional [Box[]] instead of [Interval[]]. @mratsim may also have useful feedback along those multidimensional lines.

andreaferretti commented 4 years ago

Wouldn't it be much simpler to leave [] as it is, and use a function like mySortedSeq.getByRank(12)? This is not the most common operation, I am not sure it deserves its own operator - let alone, take an existing operator, repurpose it and introduce a new operator for standard access

c-blake commented 4 years ago

The fundamental tension here is definitely between consistency across contexts and simplicity, and it's definitely off the well worn path due to poor developer awareness. That's why I opened an RFC to solicit opinions and/or better justify the envisioned approach. So, definitely thanks for the feedback!

Perhaps I should have been more explicit that there are 3 states all realizable from the same generic instantiation (at least possibly): 1) positional only - insert/delete/add type access 2) keyed only (no augmentation) - hash like access 3) simultaneously 1 & 2.

How common they all are depends upon context (not unlike BackwardsIndex behind [^1]). Unlike BackwardsIndex the calculation being replaced varies significantly in scalability. I agree in a 3) context, rank/positional access is not so common outside (often underappreciated) order statistics and a getByRank would do the trick. However, in an unkeyed context like 1), it is all you can do, as with seq.

What you seem to be suggesting is changing the meaning of [] between keyed and positional under such instantiation switches or else maybe having no [] at all. That also seems complex, confusing, and/or unpopular. If we used no operators anywhere and had getByKey that would also be clear, but like writing out x[BackwardsIndex(1)] which I'd think unpopular.

Another way to analyze it is "code evolvability". Someone writes a bunch of code using [] and seq and then realizes that dynamically many mutations can happen in the middle. Or they write a bunch of hash code (maybe smartly with this new {} operator instead of legacy []) and realize they want it automatically sorted with maybe ranks. The distinct operators let you just change the type constructor and enjoy the better scalability across your new, expanded set of operations. It's honestly a lot like overloaded/type dispatched syntax questions in general.

c-blake commented 4 years ago

Put another way, [] is "left alone" for seq and the new B-tree. [] for hashes should maybe always have been different and needs an alias to comport with the new thing, and that duplication/similarity to other languages with more rigid grammars can create code evolution headaches. There's a real symmetry between keyed & positional access with ranked trees. You can have either or both.

andreaferretti commented 4 years ago

I am not sure what is a data structure where you only have access by rank

c-blake commented 4 years ago

Well, "rank" is a loaded word implying some kind of key order which is why I keep saying "rank"/"positional". When you have no defined key type, you can still augment the tree - every proc is identical except the seek that fills a path. All inserts/deletes/rebalancing just work off of that path. Access by a rank-like integer/position is then still fine, but that integer is notionally now more just a "position" because there are no keys to sort/rank. So, "becoming sorted" once a key type is available is a semantic switch, but kind of an in-the-user's face one since they're either providing a key type or not.

In more concrete terms, an augmented tree can be like: 1) a seq 2) a Set/Table, but sorted by a key 3) both where the integer access now means rank by the key.

(These all apply to all binary search trees as well as B-trees, but the latter are more efficient in today's memory systems.)

andreaferretti commented 4 years ago

Ok, but I am not sure why you want to expose the path to the user. Many programming languages use data structures based on trees (e.g. Scala, Clojure) without compromising on the interface that they present to the user, which consists of the usual suspects: sequences, ordered sequences, dictionaries and so on

c-blake commented 4 years ago

The path might well be internal/unexposed as part of some rawGet family. I mention it to try to explain the symmetries involved since you seemed to not understand.

In terms of cross language comparisons, type names are the lowest common denominator. Many programming languages don't have Nim's define-new-operators to reflect distinct semantics with distinct syntax and so this question literally cannot arise except at the initial language design stage where the language designer probably has bigger problems to worry about. I haven't looked into whether Scala or Clojure even provide augmented trees - even as a type triplet or user-operators. They may well not. Many prog.lang runtimes neglect "completing the square" in their trees because the rebalancing is a PITA to get right in the first place. Database people do not neglect it, but they have different interfaces entirely.

In terms of bigger things to worry about, though, having {} for Sets and Tables also can emphasize that they cost more/cost differently than [] for seq. As with most things, they are "similar but also different" operations. How alike they should look is a judgement call. The situation is very much like += for string concatenation vs. Nim's &. I would say +=/+ on strings is another vestige from "fixed syntax/flexible semantics" of the 1980s C++/Python.

c-blake commented 4 years ago

One little performance example is that in Python every '.' is a dict lookup. So, "counting periods(.)s" and doing early binding is one optimization sometimes recommended. Similarly, in a less "freakin' everywhere" sense, counting braces {} in Nim would make sense under my proposal. Making expense explicit often helps people write more efficient code.

I'm not advocating getting rid of [] for Table, just providing the alias for our triple personality B-tree. Uncaring users/refugees from other prog.langs could basically go on as before only making their code harder to evolve to a new type and occasionally wondering "why the alias in others' code?". Users that either want to highlight associativeness of lookup or to make code compatible with either tree|Table backing could use {}.

We could vary meanings of operators across a type triplet which I think is your counter proposal. I think that's also confusing. Look, I get it. This flexibility of trees is poorly disseminated and anything not well known is confusing at first. So, it is true that client coders may have to learn something they didn't get taught well/at all to fully understand, but A) the access distinction also has multiple motivations, and B) the learning can be mostly taught with a sortedSeq which everyone does learn. (A sortedSeq combines what I called modes 2&3, though, but that's not the most confusing part.)

c-blake commented 4 years ago

I also would also say this isn't a "compromise" on the user interface. I think that attitude is equivalent to "rigid syntax/flexible semantics of C++/Python is uncompromised" or "all prog.langs should be the same as much as possible" or something. I disagree with both of those and I'm not sure how else to rationally call it a "compromise".

It is as little (or as much) a compromise as & for seq/string concatenation. The entire situation is almost exactly analogous to that. Other langs use + and type names because they must. Concat & add are sort of the same and sort of different. Sameness-wise, we even call the append operation "add". Differentness-wise, arithmetic is not allocating. EDIT: And multi-personality-wise, people want + to be available for vector arithmetic on seq.

c-blake commented 4 years ago

{} are used for initializer syntax of dict in Python and in Nim we have that for associative pair lists and set which is implicitly keyed. It's pretty coherent with both those notations and with how [] works for both array/seq initializers and access to have this distinction available.

To me, at least, x{key} seems very likely to be interpreted correctly even by someone who's never seen it before. They may simply think it's about consistency with distinct initializations not either efficiency nor code evolvability. So, initializer consistency would be a third motivation.

andreaferretti commented 4 years ago

We have types to distinguish the [] access in seqs from the [] access in tables. You are writing a lot, not all of which I am sure I follow. Can you you write a simple example of a data structure where you think your proposal would improve the syntax? The only example I got were ordered seqs - which I don't think warrant such a big change - and trees, which could overload [] with a path object.

Nim is pretty uniform in how it treats access: x[i] means you call the [] appropriate function overload with respect to type(x) and type(i).

If x is such that it has more than one pattern of access, one can use different types for i (as it happens for access in reverse order), or just use a different name for the operation. For ordered sequences, I would be happy with either

myOrderedSeq.ranked(i)
myOrderedSeq[R(i)]

where R constructs an object of type Rank which can be used to select the correct overload.

We could vary meanings of operators across a type triplet

We are not varying meaning. x[i] means access item i on x - it is just that what this means varies based on the data structure that x represents. Using a different notation for random access vs. positional access seems an implementation detail - not to mention the headache that it would bring in updating all existing libraries, tutorials and more

c-blake commented 4 years ago

The path is what is the same across access styles..one type of path for various kinds of seeking that can populate it. Hence my [Keyed()], [Position()] kind of examples earlier on.

I wrote a lot to try to explain it many different ways. You keep "casting it back" to simpler scenarios that remove the question. I was proposing overloads like you mention, but with an additional {} operator to make things nicer, like [^1] does for BackwardsIndex.

c-blake commented 4 years ago

To do it your way, with no extra operator and not varying operating meaning across instances, people would have to always say x[Keyed(key)] to get keyed acess to a B-tree (since it could have int key types and so that alone cannot distinguish). Or they would have to say x[Position(i)] for the converse situation, depending on whether you want the tree interface to bias towards a seq substitute or a Table substitute. (And, as I mentioned, there are at least two other arguments independent of trees for the syntax addition, but maybe 3 arguments with trees in there is critical mass?)

What is fundamentally going on is that there are two different things positional and associative access "named the same", namely [], but A) they are already named differently in initializer contexts, and B) some things support both/either, [ and C) depending on type have very different performance characteristics ]. But really all this is just repeating myself. You may want to read from the top with the understanding gleaned from this follow-on discussion and reconsider your thumbs down. :-)

c-blake commented 4 years ago

Oh, and I use SortedSeq because it is the simplest thing that exhibits the problem. This is an interface RFC more or less only about whether x{key} and x{key}= are a nicer interface and that's the whole code example. If the answer is "no" we can just do x[TreeKey(key)] or x[TreePos(i)].

Tutorials that do not want to worry about consistency with initializers, operation efficiency highlighting, or code re-targetability do not need updating. "But wait, you say, those all sound like good things!" Exactly, I say. So, why not add the {} and {}=? :-)

Araq commented 4 years ago

@c-blake I personally like [] vs {} but I see little chance of getting it accepted. Most people don't like new notations and prefer longer names. In the end only Leibniz should be allowed to introduce new useful notation. So using ∫ for integrals is fine even though I never deal with integrals but JSON which I deal with all day long shouldn't get any kind of shortcuts because Java/Python cannot do that. It's a lost battle. ;-)

c-blake commented 4 years ago

Well, you made me laugh out loud with the Leibniz reference, but it bears mentioning that any new programming language is a new notation. So, this falls into the "don't use & or ^1 or ..< because it's not like other prog.langs" rationale. If we really don't vary the meaning and keep one generic type then people will have to always do x[TreeKey(key)] or x[TreePos(i)] to select access. (Just "Tree" since we (or someone) may well add binary trees that have similar flexibility someday and it has to be agreed to not be too frivilous about name multiplication).

Unsure what vote threshold/weighting counts for "acceptance". Let's see if I have persuaded anyone else first before declaring it a lost battle. :-)

c-blake commented 4 years ago

TLDR - All hail @Araq - the new Liebniz! ;-) Doing for programming what he did for calculus. While the Brits muck about with [VerboseType()], the Continentals zoom along with {}. ;-)

alehander92 commented 4 years ago

but this is a valid rationale :D you can only differ by other languages in a limited amount of things, before some people start getting frustrated (and trust me, i am a big fan of "reinventing something with a new API")

c-blake commented 4 years ago

[] still working on the more vanilla Table should limit frustration. What people try first will work. Something else will, too, that something else is nicer in several ways.

c-blake commented 4 years ago

And fully functional all-3-modes trees in stdlibs are rare enough people should have near zero prior expectations.

c-blake commented 4 years ago

(but rarity doesn't make them un-useful!)

andreaferretti commented 4 years ago

To do it your way, with no extra operator and not varying operating meaning across instances, people would have to always say x[Keyed(key)] to get keyed acess to a B-tree (since it could have int key types and so that alone cannot distinguish).

I think I finally understood your point. I didn't think of exposing the B-Tree at all! Of course, one should have types BTreeSeq and BTreeTable (maybe with better names) each with their own semantics, even if they use a B-tree under the hood. The B-Tree may be exposed, with an interface which reflects the natural operations on it, but this has nothing to do with data structures that use that under the hood.

In other words, the semantics should be that of the relevant data structure, not that of the implementation underlying it. Seq and Table in Scala also use trees, but you never really need to know or care about that.

Of course, if you want to exposes B-trees and treat them as sequences or tables according to the needs of the moment, that is going to cause a lot of pain. I had assumed there shoud be wrappers available. You don't see the seq backing a table right now, it is an implementation detail, as it should be

c-blake commented 4 years ago

The "natural" operations flow from the data structure which you half-acknowledge, half because you limit yourself to only seq or only table. I think if you expand your consciousness just a tiny bit you may see the value. You seem so close.

Even if hidden and with a type triplet, we have three closely related types (probably all implemented by the same template or a maybe hidden generic). BTreeSeq would use [] for position because A) there is no other choice and B) users would complain if we make them say[TreePos(i)]. BTreeSortedTab would still need to decide does [] mean position or by key or else always use [TreeKey(key)] and [TreePos(i)]. BTreeSortedTabSeq would be that case where you stick with whatever the BTreeSortedTab decision was and then have only a verbose way to reference the alternative. If you want all these closely related types to "work/look" similar you are back to [KindOfAccess(kindOfAccessor)] or varying semantics of [] for closely related types.

Anyway, it's all pretty clunky/awkward (to me) if you try to do everything by the LHS and RHS of one [] operator and types, but very clean and crystal clear if you have a pair of indexing operators, [] and {} always for position and key. (Plus there are the initializer and performance motivations and Nim-precedent like & for breaking with tradition a little in the name of clarity.) Plus, as is implicit in all of this, positional and associative access are the two main kinds of access people reach for with terse syntax. I doubt operators would be wanted for Interval/Box/etc. any more than we would want special initializer syntax for them. The initializer notation for sets/alists already makes this hardly even a Leibniz worthy "innovation" as much as rounding out/following through on that very same notation. It's less of a novelty than &, really.

(The seq backing a Table is for both safety, so users cannot break table invariants, and for abstraction, and we may well want to not provide []= for my case 3), only allowing {}=..I.e. block a user from entering a value at a position in violation of key ordering.)

c-blake commented 4 years ago

I looked into it very briefly and I cannot find any fully functional seq-only Scala B-tree or red black tree. Just the kind of case 3 hybrid and that only in the RB tree. It also seems like Scala does not let people define their own operators either. "Lessons from Scala" may not apply here much at all.

c-blake commented 4 years ago

In Python, https://pypi.org/project/blist/ does some of these things.

andreaferretti commented 4 years ago

I cannot find any fully functional seq-only Scala B-tree or red black tree

Vectors (the default linear collections) are backed by a Trie, HashMaps by a hash trie, TreeSet, TreeMap and SortedSet by a Red Black tree. All of this is explained in https://docs.scala-lang.org/overviews/collections/concrete-immutable-collection-classes.html

This document explains the collections hierarchy. It is somewhat overengineered, but it separates well the interface from the implementation

It also seems like Scala does not let people define their own operators either

https://docs.scala-lang.org/tour/operators.html

I think if you expand your consciousness just a tiny bit you may see the value. You seem so close.

I think that there is no need for this

c-blake commented 4 years ago

This is about "nice" not "need". We do not need {key:val,...} since we have @[(key, val),...].

c-blake commented 4 years ago

Also, {} and {}= do already work fine (as @Araq probably knew :-) ). So, this is really just about whether we use them/what loose convention we attach to them (like the loose add, del conventions).

andreaferretti commented 4 years ago

I was not talking about programming languages. I was talking about civil conversation. The complete sentence would be: I think there is no need for such sarcasm, it was uncalled for

c-blake commented 4 years ago

I didn't mean to be either un-civil or sarcastic, and I still do not think I was. I earnestly thought (though I may have been wrong) that you were close to seeing value with a slightly expanded viewpoint. I'm very sorry to hear that you took that the wrong way. No personal slight was intended.

cooldome commented 4 years ago

I am for one of:

myOrderedSeq.ranked(i)
myOrderedSeq[R(i)]
c-blake commented 4 years ago

The separate proc doesn't address the triplet problem as discussed, but we surely could have hyper terse type names. Those also seem unpopular. I've yet to hear someone complain how horrid {:} is when we have @[(,)]. Maybe Guido vanR is another "Liebniz"? ;-)

github-actions[bot] commented 1 year ago

This RFC is stale because it has been open for 1095 days with no activity. Contribute a fix or comment on the issue, or it will be closed in 7 days.