Introduce a Kollection object with functions that operate on all types of items that can be containers of unlimited number of "members"

dnovatchev commented 11 months ago

The base for this issue is the email sent by @ndw to public-xslt-40@w3.org on Dec. 13th 2023, fully quoted below:

Hello all,

After a couple of weeks of discussion[1][2] about naming things, there seem to be a some quite different perspectives on the problem.

As background, let’s remember that we have a language (or a set of languages) that evolved over time. We couldn’t anticipate in version 1.0 what we would have in 4.0. We added new features in 2.0 and 3.0 that weren’t anticipated in previous versions either.

We live with decisions (some the result of long and hard battles within the working group(s)) like the fact that sequences don’t nest so all individual items are also sequences of length one.

The context for each addition to the language has been roughly: how can we add new, useful features with a minimum of backwards incompatibility.

It’s a natural consequence of this sort of evolution that there are rough edges. Why does fn:count returns the number of items in a sequence but always returns 1 if the argument is an array? Because an array is an item and an item is a sequence of length one.

(It doesn’t help that the vision of what the X* languages should be has changed over time. What started out envisioned as a tool for transforming documents from one format to another for presentation on the web or in print has grown into something that at least some members of the group view as first class, functional programming languages. That’s not bad, but it puts entirely different stresses on the design, I think.)

As we add new functions (specifically, in the case of recent discussions, but I expect the same perspectives apply more generally), I think one perspective is roughly this:

How can we name and organize the functions so that users are least likely to be surprised and most likely to be able to figure out how to solve a particular problem?

Taken to an extreme, this perspective isn’t about changing the semantics of the functions at all, it’s “just” about naming them. Is fn:get() better (easier to understand, less confusing) than fn:items-at?

I think another perspective is roughly this:

We have a messy design. It would be better if we could refactor the design so that it was more harmonious and logical. We don’t need four different, closely related functions to get items out of different sorts of data structures, we need a set of abstractions that make it obvious that only one function is necessary.

Taken to an extreme, this perspective is about reshaping the whole language so that a single, obvious set of function names emerges naturally from the carefully constructed abstractions.

I don’t think anyone holds exactly one perspective (discussions about renaming often involve some level of discussion about semantics, for example) and I’m attempting to polarize the perspectives a little bit in an effort to shine light on a larger problem, not to be divisive.

With my chair’s hat on, the main problem I see with the first perspective is that naming is hard, often personal and emotional, and will never be wholly logical (so there will always be more to discuss, so the “problem” is never resolved). It’s not quite fair to say it’s a distraction from the “bigger” issues we need to resolve, but it does take a lot of time.

I see the appeal of the second perspective. If we had a green field, we’d do things differently. I think we might all agree that, ideally, fn:count should return the number of items in a sequence, the number of items in an array, and the number of key-value pairs in a map. But it doesn’t and it can’t without fundamentally breaking things. I don’t think we’d get agreement to break fn:count, so what can we do?

A proposal to fundamentally redesign the data model would be a tough sell, I think.

One thing we could do is define a new namespace “gn” with functions that work more logically, that treat sequences, arrays, and maps, as collections and operate on them uniformly.

I suppose we could reconstruct the whole set of functions in this new namespace and focus our efforts there, perhaps going so far as to deprecate the current fn: namespace in favor of this new one. But could we get consensus to do that? Would users thank us?

I dunno. Innovations welcome.
                                    Be seeing you,
                                      norm

This issue addresses the 2nd alternative formulated briefly by Norm as:

One thing we could do is define a new namespace “gn” with functions that work more logically, that treat sequences, arrays, and maps, as collections and operate on them uniformly.

I suppose we could reconstruct the whole set of functions in this new namespace and focus our efforts there, perhaps going so far as to deprecate the current fn: namespace in favor of this new one. But could we get consensus to do that? Would users thank us?

Here are some of the obvious advantages of having a uniform Kollection concept that covers: arrays, sequences, maps, ... and possibly future new, specific, collection-like datatypes as sets:

Uniform definition and understanding of a single data type - the Kollection.
O(N) functions only, compared to O(M * N) at present. Here N is the number of functions needed for each of the current collection-like data types (Arrays, Sequences and Maps) and M is the number of collection-like data types (currently 3).
The users will need to know about and understand just the single Kollection data type and its functions, not 3 or more similar collection-like data types and 3 or more number of similar (but different) functions. Minimizing by a factor of 3 the amount of factual knowledge that a user needs is something HUGE and extremely positive.
Allowing users to say "Good Bye" to the unclear and treacherous flat-sequence concepts we have as legacy from XPath 1.0.
Staying aligned to the examples of other modern programming languages such as C# with its IEnumerable interface. It is good to know that this has already been done in other shining programming languages, thus a nay-sayer will not be able to argue that this is not doable or, if done, would be negative to the language and its users.
Freeing enormous resources and time for the members of the Community Group so that they can spend this on more valuable avenues, than trying to find similar and best names to M similar functions each defined to one of the M current collection-like data types.

Now, to dispel some plausible myths before they start circulating here:

Myth 1: This will break backwards-compatibility? No, as proposed by Norm, all the functions operating on the generalized collection data type can be in a separate, new namespace and thus no existing user-code is affected.
Myth 2: If a sequence containing a single Kollection still has count() of 1, then what is the use of the Kollection data type? Actually, as proposed by Norm, the Kollection data type and its functions reside in their own namespace. Doing things using only functions from this new namespace eliminates the possibility of using fn:count as it resides in the different, currently existing standard function namespace.
Myth 3: This will be too-complex for the users and the users will not embrace it, so let us not waste time designing it. Wow, there were such prophets saying exactly the same about LINQ in 2005. As it often happens, the future proved them wrong. Users clearly and overwhelmingly "voted with their code" incorporating LINQ in almost all everyday applications and code repositories.
Myth 4: Banning the current functions operating on sequences, arrays and maps would be a huge burden to the users, and would intervene negatively with their programming. In fact, nobody would be banning any of the existing functions. Users can continue to use them forever. The acceptance of the uniform and generalized Kollection data - type can happen gradually with time, as was the case with the addition of LINQ to C#.

michaelhkay commented 11 months ago

Sadly, I think this is one of those cases where any attempt to simplify runs the risk of adding another layer of complexity. Instead of three sets of collection functions, the danger is that we end up with four.

I think the acid test of whether this works and is truly a simplification is if we can define functions like get(), append(), and fold-left() without the specification resorting to an enumeration of cases: if the input is a sequence, return X, if it is an array, return Y, if it is a map, return Z. I think the characteristics of sequences, arrays, and maps (at the object model level) are sufficiently different that this will be hard to achieve. But if someone wants to prove me wrong, by all means come up with a detailed proposal.

Regarding point 5, I think the reason Java and C# have been able to come up with a common API for handling different kinds of collection is that they have started by building a level of uniformity into the data model which XDM lacks. Because the XDM data model is essentially a composite of XML and JSON, rather than something designed ab initio, it lacks that conceptual simplicity.

ChristianGruen commented 11 months ago

I think the acid test of whether this works and is truly a simplification is if we can define functions like get(), append(), and fold-left() without the specification resorting to an enumeration of cases: if the input is a sequence, return X, if it is an array, return Y, if it is a map, return Z. I think the characteristics of sequences, arrays, and maps (at the object model level) are sufficiently different that this will be hard to achieve. But if someone wants to prove me wrong, by all means come up with a detailed proposal.

I agree. To start with the most basic function, what would be a reasonable function signature for gn:get, and what would be reasonable results for the following queries?

gn:get( 1, 1 )
gn:get( 1, 2 )
gn:get( [1], 1 )
gn:get( [1], 2 )
gn:get( [1, 2], 2 )
gn:get( ([1, 2], 2), 2 )

dnovatchev commented 11 months ago

Sadly, I think this is one of those cases where any attempt to simplify runs the risk of adding another layer of complexity. Instead of three sets of collection functions, the danger is that we end up with four.

Thanks for providing the contents for Myth 5 !

Actually, the people who use Kollection would rarely, if at all, need to use functions on sequences, arrays and maps. People, who haven't started yet using Kollection, will continue to use the 3 * N functions they have been using all along. It is expected that in the long run the majority of users will get to be comfortable with using just one set of functions - those of the Kollection data type.

I think the acid test of whether this works and is truly a simplification is if we can define functions like get(), append(), and fold-left() without the specification resorting to an enumeration of cases: if the input is a sequence, return X, if it is an array, return Y, if it is a map, return Z. I think the characteristics of sequences, arrays, and maps (at the object model level) are sufficiently different that this will be hard to achieve. But if someone wants to prove me wrong, by all means come up with a detailed proposal.

As already said, when one uses Kollection, they use only its functions. Every object handled by the Kollection functions is either an atomic item or a Kollection itself. Thus, no arrays, sequences or maps.

One way to achieve this is by providing 3 Kollection constructors that take a sequence, an array, or a map, and produce a Kollection containing exactly all the members of the initial argument. This will automatically turn any deeper-level arrays, sequences or maps, into collections themselves.

We will also have similar 3 constructors that create out of a given Kollection an array, a sequence or a map. It would be a completely explicit (not automated) decision that the user specifies about what kind of constructor to use in either of the directions: inward into a Kollection or outward into one of the 3 present-day collection types, Thus we do not have to prescribe different function behavior in any such case.

Regarding point 5, I think the reason Java and C# have been able to come up with a common API for handling different kinds of collection is that they have started by building a level of uniformity into the data model which XDM lacks. Because the XDM data model is essentially a composite of XML and JSON, rather than something designed ab initio, it lacks that conceptual simplicity.

Let us turn back to history. In 2005 very few people believed in the success of LINQ. They regarded it as a curiosity and a toy for "playing functional programming" by a few, very special people. Of course, time proved them wrong.

dnovatchev commented 11 months ago

I think the acid test of whether this works and is truly a simplification is if we can define functions like get(), append(), and fold-left() without the specification resorting to an enumeration of cases: if the input is a sequence, return X, if it is an array, return Y, if it is a map, return Z. I think the characteristics of sequences, arrays, and maps (at the object model level) are sufficiently different that this will be hard to achieve. But if someone wants to prove me wrong, by all means come up with a detailed proposal.

I agree. To start with the most basic function, what would be a reasonable function signature for gn:get, and what would be reasonable results for the following queries?

gn:get( 1, 1 )

gn:get( 1, 2 )

gn:get( [1], 1 )

gn:get( [1], 2 )

gn:get( [1, 2], 2 )

gn:get( ([1, 2], 2), 2 )

@ChristianGruen I am not aware of any function get thus I don't know what the above means. We are at the very starting point where we have to decide what functions should be provided (or not) for processing Kollections.

If there were such a hypothetical function, all of the examples 1 to 6 above would produce a type-error because they were provided non-Kollection arguments.

ChristianGruen commented 11 months ago

@ChristianGruen I am not aware of any function get thus I don't know what the above means. We are at the very starting point where we have to decide what functions should be provided (or not) for processing Kollections.

Thanks. I’ll be more patient, looking forward to your suggestions.

michaelhkay commented 11 months ago

One way to achieve this is by providing 3 Kollection constructors that take a sequence, an array, or a map, and produce a Kollection containing exactly all the members of the initial argument. This will automatically turn any deeper-level arrays, sequences or maps, into collections themselves.

You are now suggesting an extension to the data model to introduce a fundamental new kind of object. Past experience suggests that's a two-year project (100 x 1 hour meetings plus half a dozen week-long face-to-face meetings). There's no evidence to suggest it might be easier this time around. I don't believe we should attempt something on this scale.

dnovatchev commented 11 months ago

One way to achieve this is by providing 3 Kollection constructors that take a sequence, an array, or a map, and produce a Kollection containing exactly all the members of the initial argument. This will automatically turn any deeper-level arrays, sequences or maps, into collections themselves.

You are now suggesting an extension to the data model to introduce a fundamental new kind of object. Past experience suggests that's a two-year project (100 x 1 hour meetings plus half a dozen week-long face-to-face meetings). There's no evidence to suggest it might be easier this time around. I don't believe we should attempt something on this scale.

I know this might seem "too much" at a first glance, but this is time much better spent and still significantly less than continuous and indefinite efforts to invent non-conflicting names for 3 "different" functions that all do "almost the same" - such effort is doomed to continue forever and consume more and more of our time as we add new functions to the old types of collections.

Also, in our case this is way easier to do, as we already have the successful example of C# LINQ. Thus we could even decide for the initial iteration not to "invent" anything but Copy almost mechanically what the IEnumerable interface has to offer at present.

Not being the first but the 2nd in a new field provides almost a smooth way to at least accomplish what the first inventors did already with success, and doing all this while having much more information and already-known solutions compared to what otherwise would have been totally unexpected problems with no certainty of success.

Arithmeticus commented 11 months ago

I am skeptical. I fear this proposal would introduce chaos or confusion. I am happy to be proved wrong, but for that to happen, I would like to see proponents provide a summary list of where the specs would need to be changed. (And the follow-up question, of course, is who is willing to do that work.)

If I understand correctly, the proposal is concerned primarily with issues raised in #843, so I'll make my own countersuggestion in that thread.

dnovatchev commented 11 months ago

I am skeptical. I fear this proposal would introduce chaos or confusion. I am happy to be proved wrong, but for that to happen, I would like to see proponents provide a summary list of where the specs would need to be changed. (And the follow-up question, of course, is who is willing to do that work.)

If I understand correctly, the proposal is concerned primarily with issues raised in #843, so I'll make my own countersuggestion in that thread.

Actually, no confusion can arise as the set of new function-namespace functions has empty intersections with either the fn:, array:, map: function members:

A user may choose to ignore completely the new data type - they continue to work as at the present moment - thus no confusion.
Or, a user could choose to work only with Kollection - instances and only if necessary convert the final results to one of the 3 present-day collection-types - again no confusion,

Arithmeticus commented 11 months ago

That doesn't answer my point at all, which was for a list of places where the specs would need to be changed, if this proposal were to be realized.

dnovatchev commented 11 months ago

That doesn't answer my point at all, which was for a list of places where the specs would need to be changed, if this proposal were to be realized.

This of course needs to be determined precisely and will most likely be an iterative process.

What can be said immediately is that none of the existing functions descriptions will need to be changed -- only the functions in the new namespace will be added.

MarkNicholls commented 11 months ago

@dnovatchev

I'm a little unclear what the proposal is, can you make it real? (even in a crude way that you would immediately want to amend or dismiss)

You talk about Linq as an analogy, or as something more concrete?

If more that an analogy, the XQuery sits on top of the same CS theory as Linq, i.e. this abstraction already exists (broadly), its just that XQuery is an "instance" of it? are you proposing other "instances"?

dnovatchev commented 11 months ago

@MarkNicholls

Here is a good example from C# - the IEnumerable interface.

If you click on the link above, you will get to the official Microsoft page about the IEnumerable interface. Then, if you click on More... under Derived, you will see hundreds of classes implementing the IEnumerable interface.

In a nutshell, this is an interface that almost all .NET classes in the System.Collections namespace implement.

A C# programmer knows that regardless of what type of a collection he is working with, all the methods and properties that the IEnumerable interface exposes are there to use.

The idea here is to have something similar -- an abstraction that lies in the base of all known and future XPath collection types, so that the user will not have to learn and use M * N sets of (N) similar functions (each for one of the M existing collection datatypes) but just a single set of these functions, that is provided by the Kollection abstraction (namespace).

We do not have the same concepts of inheritance here, but we can have Kollection constructors that turn any of the M existing collection datatype instance into a Kollection. And also the reverse constructors, that turn a given Kollection into one of the desired present-day collection datatypes.

At present M = 3 and the mentioned collection datatypes are: Arrays, Sequences and Maps.

MarkNicholls commented 11 months ago

@dnovatchev

ah ok, that helps a bit, but there's several layers of stuff going on in dotnet.

i) There's IEnumerable its just a factory method for an IEnumerator. ii) IEnumerator has 3 methods, current/movenext/reset. iii) Then there's the monadic constructs i.e. the LINQ combinators itself and the syntactic sugar that makes it accessible. iv) then there's a set of library functions (rather than methods that are built to be used with LINQ (but are quite useful in general), not that many.

You can steal i) and ii) (some languages dont do "i") and they're a bit imperative, "tryHead" and "tail" would work too, I don't think you need anything more than that.

I don't think it really gets you anywhere interesting until you get to iii)...but iii) is very very closely related to XQuery, so you have it already? You can generalise it (like LINQ (amongst others) does) and make it applicable to other constructs.

Is this the sort of thing you are suggesting?...or are you just saying you want a nicely defined set of functions i.e. "iv" built on top of "ii"? (which sounds perfectly sensible and quite achievable)

dnovatchev commented 11 months ago

@dnovatchev

ah ok, that helps a bit, but there's several layers of stuff going on in dotnet.

i) There's IEnumerable its just a factory method for an IEnumerator. ii) IEnumerator has 3 methods, current/movenext/reset.

We already have a proposal for something similar, called "generator", which combines the concept of iterator and generator, but doesn't provide any syntactic sugar for the latter.

iii) Then there's the monadic constructs i.e. the LINQ combinators itself and the syntactic sugar that makes it accessible. iv) then there's a set of library functions (rather than methods that are built to be used with LINQ (but are quite useful in general), not that many.

Could you, please, specify more precisely what exactly are you talking about here? If you mean methods like Select, Where, SelectMany, Any, Count, First, etc, yes we already have these, but again - in a trippled quantity, and we want to have to use just one with the Kollection abstraction.

If you mean something else, then please, refine/explain.

You can steal i) and ii) (some languages dont do "i") and they're a bit imperative, "tryHead" and "tail" would work too, I don't think you need anything more than that.

I don't think it really gets you anywhere interesting until you get to iii)...but iii) is very very closely related to XQuery, so you have it already? You can generalise it (like LINQ (amongst others) does) and make it applicable to other constructs. Once again, could you be more specific exactly which features/components of LINQ are you speaking about?

Is this the sort of thing you are suggesting?...or are you just saying you want a nicely defined set of functions i.e. "iv" built on top of "ii"? (which sounds perfectly sensible and quite achievable)

We definitely are talking here about both "ii" and "iv", and probably also about "iii" if you explain what "iii" actually means.

MarkNicholls commented 11 months ago

ok, so I'll assume i) and ii).

iv) is just about defining the various 'standard' functions you want to use across the various instances of your iterator/generator.

iii) differs from iv) in the sense its just syntax (that requires some specific functions to be defined), though iv) contains many more functions that aren't part of the sugar, they're just useful across types that support IEnumerable.

So Sum(IEnumerable<int>), for example can be just a useful function which is you could make applicable to Array, Map (?) and Sequence in a uniform way (so for me part of just section "iv").

but Where/Select/SelectMany are 'special' and have 2 forms, one without sugar and one with (so are part of the the sugar "iii").

without is...

var allLegColours = cows.SelectMany(cow => cow.Legs().Select(leg => leg.Colour())) and with is

var allLegColours = 
   from cow in cows 
   from leg in cow.Legs()
   select leg.Colour()

this latter format is effectively XQuery, and there would seem little reason to reinvent the wheel, it may be interesting to put it in different places or apply it to different collections, but its already there.

dnovatchev commented 11 months ago

ok, so I'll assume i) and ii).

iv) is just about defining the various 'standard' functions you want to use across the various instances of your iterator/generator.

iii) differs from iv) in the sense its just syntax (that requires some specific functions to be defined), though iv) contains many more functions that aren't part of the sugar, they're just useful across types that support IEnumerable.

So Sum(IEnumerable<int>), for example can be just a useful function which is you could make applicable to Array, Map (?) and Sequence in a uniform way (so for me part of just section "iv").

but Where/Select/SelectMany are 'special' and have 2 forms, one without sugar and one with (so are part of the the sugar "iii").

without is...

var allLegColours = cows.SelectMany(cow => cow.Legs().Select(leg => leg.Colour())) and with is
var allLegColours = 
   from cow in cows 
   from leg in cow.Legs()
   select leg.Colour()
this latter format is effectively XQuery, and there would seem little reason to reinvent the wheel, it may be interesting to put it in different places or apply it to different collections, but its already there.

(There are languages that allow you to extend the sugar, but I'm not a fan, in a similar way to our conversation about magic strings, allowing people to invent languages inside languages in multiple syntaxes at some point just hurts your head).

We are talking only about functions here, not about adding any syntactic sugar, as this would put additional (and mostly unneeded) strain on the implementors.

MarkNicholls commented 11 months ago

ok, its now clear to me what the proposal is.

MarkNicholls commented 11 months ago

actually having let the cogs turn I'm not 100%.

if you are saying i, ii, iv, then are you saying that there is a standard single function e.g. sort(), that can be applied to any collection and utilises the standard interfaces defined in i & ii to do the sort?

In which case, I would ask, what is the benefit of that over having function for each collection, named identically (for simplicity and substitution) but implemented in an optimal manner for that collection (for efficiency)?

or are you suggesting the latter?

So I'm still not 100% clear.

dnovatchev commented 11 months ago

@dnovatchev

1st things 1st, where's the proposed iterator generator?

(I may be being very naive, but I don't see any significant problem here...except the risk of getting it horribly wrong).

The word "generator in my earlier reply is a hyperlink, Just click on that link 😄

dnovatchev commented 11 months ago

actually having let the cogs turn I'm not 100%.

I have the feeling that we are moving in a closed circle with no progress 😢

if you are saying i, ii, iv, then are you saying that there is a standard single function e.g. sort(), that can be applied to any collection and utilises the standard interfaces defined in i & ii to do the sort?

I never said such thing. The Kollection concept has been just proposed, there are no functions in it yet. If there would be a sort function in the Kollection namespace, which seems quite probable, it would sort only instances of Kollection - not any instances of "any collection".

The idea is to work only with instances of Kollection. This is not a proposal that any present-day collection type should implement the "Kollection interfaces", but that the user will construct instances of Kollection and work only with these instances until the wanted results are produced. At that moment, and if necessary, the user can explicitly convert the Kollection result to any present-day collection.

In which case, I would ask, what is the benefit of that over having function for each collection, named identically (for simplicity and substitution) but implemented in an optimal manner for that collection (for efficiency)?

You are asking what is the benefit of having N functions rather than M * N functions? Seriously? Then let us stop here.

or are you suggesting the latter?

So I'm still not 100% clear.

MarkNicholls commented 11 months ago

You are asking what is the benefit of having N functions rather than M * N functions? Seriously? Then let us stop here.

absolutely, rarely are there free lunches, and I think you are proposing this, just that N = 5ish.

OK, looking at the generator proposal makes it clearer, sorry for not clicking the link earlier, LINQ and N * M was confusing me.

If you have N signatures in you generator and M data types, then you will have N * M implementations. For IEnumerable N is small (3) and this isn't a problem.

For the functions that you build on top, yes, there will be 1 implementation, and these functions will be hamstrung to operate only via your generator signature definition (sometimes a problem), and if you add the signatures to your generator, then N <- N+1, and you may have another problem.

Is this a fair summary of the proposal?

If so, it seems sensible.

P.S. I find it misleading to use LINQ as an analogy."Language INtegrated Query" has little to do with what you are talking about , IEnumerable just happens to be one of many datatypes that can be used in this way, and predates LINQ by many many years (in a non generic form), but so is Option<>, Task<>, Future<> etc etc, if you were suggesting allowing XQuery to query a "Kollection" then the analogy stands just for that specific integration.

dnovatchev commented 10 months ago

Is this a fair summary of the proposal?

If so, it seems sensible.

I am glad that we have established a common understanding here.

qt4cg / qtspecs

Introduce a Kollection object with functions that operate on all types of items that can be containers of unlimited number of "members" #910