Proposal to introduce the set datatype in XPath 4

dnovatchev commented 3 years ago

It is high time that we come up with a set type in XPath. We actually have to deal all the time with sets (not just node-sets, but sets of any-type values), and it is painful to read in the spec how two maps are compared for equality when explaing fn:deep-equal():

"If $i1 and $i2 are both ·maps·, the result is true if and only if all the following conditions apply:

Both maps have the same number of entries.

For every entry in the first map, there is an entry in the second map that:

has the ·same key· (note that the collation is not used when comparing keys), and

has the same associated value (compared using the fn:deep-equal function, under the collation supplied in the original call to fn:deep-equal)."

When if we had the set type the above would simply say:

"If $i1 and $i2 are both ·maps·, the result is true if and only if the sets of their keys are equal, and the corresponding values for each key in the two maps are deep-equal."

I propose that starting with XPath 4.0 we introduce the set type and define set equality, the union ( | ), intersection (intersect) and set difference (except) not only for node-sets but for sets of any-typed values.

Then we can have a function: *`to-set($collection as item()) as set`**, which would produce a set (of the distinct values) of any collection-typed argument supplied to it: sequence(its distinct values) , map (a set of its entries), array (a set of its members).

This makes fn:distinct-values() almost unnecessary.

We will no longer have to explain in a "Remarks" section that the result of a function is "unordered" or that its order is "implementation-defined" -- just by making this function return a set.

How can almost all major programming languages (not even speaking of SQL), such as C#, Python and Java have a set data type / interface, but even in XPath version 4 we still have to describe it in a free language narrative?

Thanks, Dimitre

rhdunn commented 3 years ago

I support being able to perform the set operations on atomic types -- it makes expressions like (1, 2) union (1, 3) possible.

Having the rules explicitly defined makes it clear what the semantics of the function/expression are without constraining a given implementation. The "implementation-defined" ordering allows an implementation to use a set data structure, but would also allow a database (like MarkLogic and eXist-db) to make use of indices and database tables; saying it returns a set (in which the semantics would need defining) would prevent the use of indices, or make it difficult to do that in a standards-conforming way.

I would prefer adding a fn:distinct-nodes function that operated like fn:distinct-values does, but for nodes. That way, existing code will still work and the names are similar. (There is a functx:distinct-nodes function that does this.)

It may also be useful to define/support a new set(*)/set(ItemType) SequenceType that accepts node or atomic type ItemTypes. The type of that variable/parameter would then be a sequence with the restriction that the values are unique (duplicates are removed using node/atomic type identity) in an implementation-defined order. -- This would be equivalent to passing the sequence/array values assigned to the variable/parameter through the fn:distinct-nodes or fn:distinct-values function depending on the ItemType associated the set.

michaelhkay commented 3 years ago

If we want to support operations on sets of atomic values, I don't think that overloading the existing operators is a good idea. Without strong typing in the language, overloading existing operators can mean that constructs that were previously statically optimised can no longer be optimised in the same way: for example, you can no longer infer that the result of "union" consists entirely of nodes. I would prefer to do it with a collection of functions that operate only on sets of atomic values (set:of, set:union, set:intersect, set:difference, set:contains etc).

I wouldn't want to mix sets of atomic values with sets of nodes, because the equality operator is so different in the two cases.

If sets of atomic values are represented as maps, then implementation of functions such as (set:of, set:union, set:intersect, set:difference, set:contains) is very straightforward and arguably doesn't need any language support. By contrast, introducing a new fundamental type to the type system is very costly in terms of increased complexity, and shouldn't be attempted without very good reason. This is largely because unlike object-oriented languages like Java, Python, etc, the type system isn't extensible in the same way.

dnovatchev commented 3 years ago

@michaelhkay,

I wouldn't want to mix sets of atomic values with sets of nodes, because the equality operator is so different in the two cases.

We need to recognize the fact that sets of nodes already exist in the language, these are the node-sets from XPath 1.

Although with added ordering and thus disguised as sequences, the node-sets don't magically stop being sets and the operators on them don't stop being set operators. Even the proposed addition of mathematical symbols: ∪, ∩, ∖,∃, ∀ misses the fact that all these are used as set operators in Math. Just starting to use them on sequences would confuse and mislead most people, who know that these actually work on sets.

We need to recognize and clarify this fact, and treat sets as what they are: sets.

If we want to make a sequence from a set, we can give the set a particular ordering -- this would make a good function. We can give as many orderings as we want to a set and make it possible to produce different sequences from this set, just by using one or another ordering it has been provided with.

As you say in your comment, adding support for sets of items of other types (such as atomic) is easy.

The value for the users of the language is the same as the value sets have in other programming languages.

The lack of sets as a concept results in the current situation: user questions about these strange "sequences with de-duplicated values", and also the big volume of free language narratives in the specifications, describing the properties of what otherwise would use just one, strictly defined word: set.

I strongly support the recognition of sets in the language and defining the set data type. Any excuses that this "would require much work" sound strange - aren't we here to do work? If we try not to do work, then why are we here at all?

dnovatchev commented 1 year ago

Here is one actual use-case, taken from a request for help from Martin Holmes in the "general" channel of the xml.com Slack:

"I have a collection of sequences of strings (ids, actually) stored in a map; some of these may be identical, although in different orders within the sequence. I'd like to generate the list of distinct sequences. The obvious way is to turn each sequence into an alphabetically-ordered string using string-join and a delimiter, then run distinct-values on those, and then retokenize them. Is there a more elegant way?"

This would be trivial to do, had we the set datatype in the language, but caused obvious blocking problems to the OP and required extensive problem solving until we came up with the following non-trivial XPath solution, which invokes repeatedly 7 different standard functions:

let $seq1 := (1, 3, 5, 6, 8),
    $seq2 := (2, 4, 6, 9, 11),
    $seq3 := (5, 6, 3, 8, 1),
    $makeSet := -> ($seq) { map:merge( $seq ! -> {map:entry(., 1)} (.) ) },
   $makeSetFromKeys := function($m as map(*)) { map {sort(map:keys($m)) => string-join('!!!') : 1}},

   $distinctSets := [$seq1, $seq2, $seq3] => array:for-each($makeSet) 
                                          => array:for-each($makeSetFromKeys) => array:flatten()

 return
    map:keys(map:merge($distinctSets))

michaelhkay commented 1 year ago

I definitely consider this a step too far. There are use cases, but they are not nearly strong enough to justify the degree of disruption this data model change would cause. It's this kind of change that leads to new releases of the specs taking 7-10 years to create.

ChristianGruen commented 1 year ago

We could add an array:distinct-members function for Martin Holmes’ use case. The members would be compared via deep-equal.

dnovatchev commented 1 year ago

I definitely consider this a step too far. There are use cases, but they are not nearly strong enough to justify the degree of disruption this data model change would cause. It's this kind of change that leads to new releases of the specs taking 7-10 years to create.

A definition for "strong enough", please?

If we really wanted to support diversity (btw this is an open action item of the CG), then we would be more tolerant and accepting of new ideas, especially when there is actual demand for them, based on the expressed needs of community members.

Good that scientific knowledge didn't have to evolve under so strict compatibility norms (similar to those in the X* languages) or otherwise we would still say that the Earth was flat and maybe Australia wouldn't exist.

dnovatchev commented 1 year ago

We could add an array:distinct-members function for Martin Holmes’ use case. The members would be compared via deep-equal.

Definitely, though this alone wouldn't be enough. Martin Holmes needed to transform each sequence to a normal (sorted) form, which actually is to convert and use it as a set, before doing any comparisons.

ChristianGruen commented 1 year ago

maybe Australia wouldn't exist.

The aboriginals would have got along without us, I guess.

dnovatchev commented 1 year ago

maybe Australia wouldn't exist.

The aboriginals would have got along without us, I guess.

😃**😃**😃

ChristianGruen commented 1 year ago

@dnovatchev In #479, you indicated a counter could be added to sets. How would an example for that look like?

And how would you like to represent sets in general?

michaelhkay commented 1 year ago

@dnovatchev Please feel free to put forward a proposal for adding sets to the data model. Such a proposal should include answers to the following questions, with rationale:

Are these sets of items or sets of values/sequences? Are there restrictions on what they can contain (e.g. functions)
Is a set an item? Can you have sets of sets?
What ItemType / SequenceType syntax is to be used for sets?
What are the subsumption rules for set types (and their relation to other types)
What comparison function is to be used for comparing the items (or values) in sets to see if they are duplicates? Can this be parameterised and if so how? For example, are nodes compared by identity or by deep-equality, or is this a user option?
What new functions and operators need to be provided for sets?
How are existing functions and operators impacted if sets are supplied as arguments? Are the changes backwards compatible?
How are sets to be converted to/from other types such as sequences and arrays? Is there implicit conversion for example in the coercion rules?

Until we have a detailed proposal on the table that provides full answers to these questions, we cannot take this any further.

I repeat my initial view that the complexity added to the specifications is likely to exceed the benefits, but until I see the full technical detail, I will keep an open mind. So far all you have said is that you think this feature would be a good idea; you haven't given any technical specifications of how it would work.

dnovatchev commented 1 year ago

@dnovatchev Please feel free to put forward a proposal for adding sets to the data model. Such a proposal should include answers to the following questions, with rationale:

Are these sets of items or sets of values/sequences? Are there restrictions on what they can contain (e.g. functions)

Is a set an item? Can you have sets of sets?

What ItemType / SequenceType syntax is to be used for sets?

What are the subsumption rules for set types (and their relation to other types)

What comparison function is to be used for comparing the items (or values) in sets to see if they are duplicates? Can this be parameterised and if so how? For example, are nodes compared by identity or by deep-equality, or is this a user option?

What new functions and operators need to be provided for sets?

How are existing functions and operators impacted if sets are supplied as arguments? Are the changes backwards compatible?

How are sets to be converted to/from other types such as sequences and arrays? Is there implicit conversion for example in the coercion rules?

Until we have a detailed proposal on the table that provides full answers to these questions, we cannot take this any further.

I repeat my initial view that the complexity added to the specifications is likely to exceed the benefits, but until I see the full technical detail, I will keep an open mind. So far all you have said is that you think this feature would be a good idea; you haven't given any technical specifications of how it would work.

@michaelhkay . @ChristianGruen

Gentlemen,

Thank you for your encouragement and support.

It will really be much better if we all try to tackle this big and important problem together, will it not?

The first, and most important stage would be brain-storming, where any idea, no matter how fantastic, can be thrown on the table.

My idea of a set is the keys of a map, after we extend the concept of map to allow having not only atomic keys, but also keys that are themselves maps or arrays. As @michaelhkay recently showed to us, even some "atomic" items do have their own set of properties, thus it wouldn't be going too-far to allow composite items as the keys of a map. And given that fn:deep-equal can be used for comparison of such composite keys, this is really possible.

Plus any set that contains at least one function item, that is not a map, array or a constant - in this case we will not represent this function item as a key of a map, but will know that this set isn't equal to any other set.

We could try in general to assign to every item a Guid property, which could be used to allow considering 2 function items with the same Guid value equal, but this may come later and I don't believe it is so important at the very early stages of getting this crystalized.

Let us all start addressing the 8 bullets above, and I believe that in this process we will be able to arrive at a reasonable solution.

I hope that this is just the beginning of us reasoning about the definition and rules governing the concept of set.

Let's do it!

ChristianGruen commented 1 year ago

@dnovatchev We currently have >160 open issues, for which pull requests need to be created, which need to be individually reviewed and accepted, and for which test cases need to be written. I understand that you'd like to get more support for doing this work on sets. I assume everyone hopes to get as much feedback as possible for one's own ideas, and I assume we all do our best. To get things done, it gets more and more clear that we need a concrete proposal for each topic that already gives answers to as many questions as possible, which can then be discussed together and completed. If that's too ambitious for sets, we need at least a basic draft (a solid base) that can be enhanced step by step.

dnovatchev commented 1 year ago

What are the subsumption rules for set types (and their relation to other types)

@michaelhkay

Thanks for the questions you summarized in the 8 bullets. Could you, please, paraphrase this one (above) so that I (a non-native English speaker) can be sure I understand its meaning?

michaelhkay commented 1 year ago

Subsumption = the rules for deciding whether one type is a subtype of another, that is, if we introduce a new type or family of types, how does this affect the is-subtype() relation given in XPath section 3.7? (The specs do not actually use the term subsumption, sorry!)

dnovatchev commented 1 year ago

@michaelhkay, I intentionally haven't gone deeper into this issue because I believe it is logically connected to our work and acceptance of another related issue, which, despite being well-defined, has not been allowed to progress for almost 2 years :

Allow a map's key value to be any sequence

Could we, please, be more constructive and do something about this?

michaelhkay commented 1 year ago

I agree these issues are closely related. Both boil down to the same problem: how do you define the equality semantics, among a set of arbitrary values, and more particularly, how do you allow the user to define the equality semantics? I haven't found an adequate answer to that question, which is why maps in 3.1 use a system-defined, context-free, transitive, error-free comparator for keys.

michaelhkay commented 1 year ago

So one way forward would be to say that a set is simply a map in which all the values are (). That reuses all the infrastructure for maps, which considerably reduces the complexity. We can add a few convenience functions like set:contains() and set:add() but they are cosmetic.

It then leaves the problem of how to allow maps with user-defined equality semantics for comparing keys.

If we adopt the idea in #334 that any XDM value can have additional properties, then we could allow a map to have two function-valued properties, defining equals() and hashcode() behaviour. Functions that modify a map, like map:put() and map:filter(), would retain the properties of the input map in the output. Functions that create or combine maps would need to have some mechanism to supply these properties, either explicitly or by default.

But of course this leaves many unanswered (and difficult) questions. How do you compare two maps that have different equality functions? What does it even mean for two equality functions to be different? Do we need to introduce function identity?

We could of course extend the domain for keys in maps without allowing user-defined equality comparisons. For example we could allow nodes as keys and insist on using node identity as the comparator. But I'm not convinced this would be useful.

ndw commented 1 year ago

Function identity seems difficult, perhaps impossible to achieve in the general case. But we might finesse that by saying that functions are different unless the implementation knowns they're the same. If I use the same global function, for example, it doesn't strike me as too much bookkeeping for the implementation to know they're the same. If I use two anonymous inline functions, well, that was just asking for trouble, wasn't it?

It strikes me that there are three plausible ways to compare maps with different equality functions:

They are the same if and only if they are the same with both sets of functions
They are the same if and only if they are the same according to at least one set of functions
It is an error to attempt to compare maps that use different functions

I lean towards 3, which has the advantage that we can always define it later and remove the error in a backwards-compatible way. But I confess I have no useful intuition about whether an application is like to have 1, 2, or 87 different sets of equality functions.

dnovatchev commented 1 year ago

So one way forward would be to say that a set is simply a map in which all the values are ().

This is exactly the idea. One can even imagine constructing a multi-set (set of items where more than one item can have the same value) as a map in which the values are positive integers (occurrence count, aka the multiplicity).

That reuses all the infrastructure for maps, which considerably reduces the complexity. We can add a few convenience functions like set:contains() and set:add() but they are cosmetic.

It then leaves the problem of how to allow maps with user-defined equality semantics for comparing keys.

It seems logical to use the map's comparison function when checking for a membership or adding a new entry to the map:

4 ∈ $X

Here the comparison function of $X will be used

The same comparison function of $X will be used for:

4 ∉ $X

If we adopt the idea in #334 that any XDM value can have additional properties, then we could allow a map to have two function-valued properties, defining equals() and hashcode() behaviour. Functions that modify a map, like map:put() and map:filter(), would retain the properties of the input map in the output. Functions that create or combine maps would need to have some mechanism to supply these properties, either explicitly or by default.

I somehow missed this new proposal, and on first reading it seems very complicated and maybe not necessary in this case.

But of course this leaves many unanswered (and difficult) questions. How do you compare two maps that have different equality functions?

There isn't a single comparison operator for two sets. There is set inclusion:

$Y ⊆ $X

by definition this is:

every $y in $Y satisfies $y ∈ $X

and as said above, in this case the comparison function of $X must be used.

If we want to check:

$X ⊆ $Y , this means:

every $x in $X satisfies $x ∈ $Y

so, the comparison function of $Y must be used.

Finally, set equality: Two sets $X and $Y are equal if both of the following are true:

$X ⊆ $Y

and

$Y ⊆ $X

We can still have $X and $Y to have their own unique comparison functions, and use $Y 's comparator in evaluating the first expression and $X 's comparator in evaluating the second expression. I don't see any problems with this.

Or, if we decide, we can define that in the case when $X and $Y have different comparators, then these two sets are not equal.

Similarly,

$X ∪ $Y is {$z \ $z ∈ $X or $z ∈ $Y}

and

$X ∩ $Y is {$z \ $z ∈ $X and $z ∈ $Y}

Then if we denote the comparison function of a set $X as Comp($x), We will have (informally):

Comp($X ∪ $Y) is Comp($x) or Comp($Y)

and

Comp($X ∩ $Y) is Comp($x) and Comp($Y)

What does it even mean for two equality functions to be different? Do we need to introduce function identity?

We could of course extend the domain for keys in maps without allowing user-defined equality comparisons. For example we could allow nodes as keys and insist on using node identity as the comparator. But I'm not convinced this would be useful.

Yes, and why do we need different maps to have different comparators, in the first place? I am not at all convinced that this is necessary. Two nodes that have different identity are obviously not equal.

Why should we not support just this simple and most natural equality?

liamquin commented 1 year ago

Sets should behave in ways that do not surprise people who use union, intersect and is operators. So they should work on node identity, atomic value comparison, and maybe an extended deep equal that uses function identity. There was (i recall) some discussion about allowing maps to have an identity, so that they could be persisted e.g. in collections. Although i was never convinced that things need identity to be persisted, i remember there were some problems around it. But if we accept the idea that two inline functions are not is-equal, just as two element nodes with the same element name, type annotation, string-value are deep-equal but not is-equal ($a is $b returns false) sets could be based on "is" semantics, and then we don't need equality functions. Probably i'm missing something.

PieterLamers commented 1 year ago

A few cents: I personally like sets and regularly test against sequences where I am really testing against set membership. Also I sometimes convert my sequence to a map to get faster results. Yet I understand that it may be difficult to add them to the landscape, for reasons that Mike and others stipulated. I don't think the XPath 1 'nodeset' was intended to be a set, and XPath 2 'everything is a sequence' seems to have disambiguated that. This fits the nature of XML documents, where the nodes are always in a particular order. As we have been using XPath and XQuery increasingly for data as well as documents, I understand the desire for tools that fit data types other than sequences. I am thankful for both @dnovatchev addressing the desire to have sets and for @michaelhkay pointing out that it's not that easy to fit them in nicely. I totally expect @rhdunn to come up with a good solution :-)

dnovatchev commented 1 year ago

I understand the desire for tools that fit data types other than sequences. I am thankful for both @dnovatchev addressing the desire to have sets and for @michaelhkay pointing out that it's not that easy to fit them in nicely. I totally expect @rhdunn to come up with a good solution

Very good thoughts, @PieterLamers,

Working as a team adds more than the mere constituents of the group.

dnovatchev commented 1 year ago

Sets should behave in ways that do not surprise people who use union, intersect and is operators. So they should work on node identity, atomic value comparison, and maybe an extended deep equal that uses function identity. There was (i recall) some discussion about allowing maps to have an identity, so that they could be persisted e.g. in collections. Although i was never convinced that things need identity to be persisted, i remember there were some problems around it. But if we accept the idea that two inline functions are not is-equal, just as two element nodes with the same element name, type annotation, string-value are deep-equal but not is-equal ($a is $b returns false) sets could be based on "is" semantics, and then we don't need equality functions. Probably i'm missing something.

I completely agree with @liamquin here.

We need just one comparison function, for atomic types it can be deep-equal (not raising errors) and for nodes, the op("is") function.

@michaelhkay Do you agree that this solves the problem and we can now extend the types for map-keys to be any identifiable item (not just atomic one)?

Both @liamquin and I think this kind of equality is what we need when doing set-operations.

michaelhkay commented 1 year ago

Yes, I think there is a way forward here.

A lot of the problems go away if we impose a system-defined equality function rather than allowing it to be user-defined. That's a restriction, of course, but we can still deliver a lot of functionality within that constraint.

The equality function should be context-free, transitive, and error-free, and ideally it should work on arbitrary sequences. This rules out deep-equal. Let's call it fn:identical(), with the following sketch as a spec:

Two sequences are identical if the items are pairwise identical
Two atomic values are identical if fn:atomic-equal() returns true (we can now drop this function)
Two nodes A and B are identical if A is B
Two maps are identical if their keys and values are identical
Two arrays are identical if their members are pairwise identical
Two functions are identical... A lot of discussion of this in issue #333 but we never completely cracked it. Let's assume it's doable, though it's certainly a stumbling block.

With this, we can extend maps to allow any value as the key.

A map then becomes usable to represent a set. Without introducing a new fundamental data type, we could introduce constructs that facilitate the use of maps to represent sets. Let's say that by convention, a set is represented as a map in which the value part of every entry is (). Then we can introduce ItemType syntax set(X) as a synonym for map(X, empty-sequence()) (by making it equivalent to an existing type, we get all the subsumption rules for free). The map:contains() function tests for set membership. We can introduce a function set:of(SEQ) where SEQ is any sequence, and the result is a set (map) that eliminates duplicates from SEQ. And we can introduce functions that form the union, intersection, or difference of two sets, or that test whether one set is a subset of another.

dnovatchev commented 1 year ago

Yes, I think there is a way forward here.

A lot of the problems go away if we impose a system-defined equality function rather than allowing it to be user-defined. That's a restriction, of course, but we can still deliver a lot of functionality within that constraint.

Exactly. There is just one notion of identity when dealing with sets. Any "separate interpretations" actually compare different projections of the set members.

Two set members are not identical if not all of their properties are the same, it is that simple. For nodes, if just one such property, say the value of fn:generate-id is different, then these two nodes are not identical.

The equality function should be context-free, transitive, and error-free, and ideally it should work on arbitrary sequences. This rules out deep-equal. Let's call it fn:identical(), with the following sketch as a spec:

Two sequences are identical if the items are pairwise identical

Two atomic values are identical if fn:atomic-equal() returns true (we can now drop this function)

Two nodes A and B are identical if A is B

Two maps are identical if their keys and values are identical

Two arrays are identical if their members are pairwise identical

Two functions are identical... A lot of discussion of this in issue Equality of function items #333 but we never completely cracked it. Let's assume it's doable, though it's certainly a stumbling block.

With this, we can extend maps to allow any value as the key.

A map then becomes usable to represent a set. Without introducing a new fundamental data type, we could introduce constructs that facilitate the use of maps to represent sets. Let's say that by convention, a set is represented as a map in which the value part of every entry is (). Then we can introduce ItemType syntax set(X) as a synonym for map(X, empty-sequence()) (by making it equivalent to an existing type, we get all the subsumption rules for free). The map:contains() function tests for set membership. We can introduce a function set:of(SEQ) where SEQ is any sequence, and the result is a set (map) that eliminates duplicates from SEQ. And we can introduce functions that form the union, intersection, or difference of two sets, or that test whether one set is a subset of another.

I only want to propose adding this:

For reasons discussed at several moments in several issues, we want to be able to maintain system-level-properties of any item (of any type). We can do this nicely if a special key-value is preserved "for system use" and not allowed to be set/updated in any XPath expression, except by calling system functions.

For example, one such special value may be the sequence: (xs:double("INF"), -xs:double("INF"))

Under this key we may store any desired "system properties" for any item.

There are no compatibility issues, because having sequences as keys was never a feature in the past versions.

michaelhkay commented 1 year ago

Two set members are not identical if not all of their properties are the same, it is that simple.

No, it's not that simple, it never is! This competes with the substitutability principle, which says that xs:integer(3) is always substitutable for xs:decimal(3) - the type annotation of an atomic value should be allowed to differ.

michaelhkay commented 1 year ago

I propose that we introduce sets as a specialization of maps. This is related but orthogonal to issue #119, which attempts to generalize what can be used as a key in a map: any improvements we make to maps will automatically apply to sets.

We introduce new ItemType syntax set(T) which is deemed equivalent to map(T, empty-sequence()). This implicitly defines all the type subsumption rules.

We define a new suite of functions to operate on sets, in a new namespace.

set:of($keys)
set:contains($set, $key)
set:put($set, $key)
set:remove($set, $keys)
set:keys($set)
set:filter($set, $predicate)
set:for-each($set, $action)
set:union($set1, $set2)
set:intersection($set1, $set2)
set:difference($set1, $set2)
set:contains-all($set1, $set2)

map:key-set($map) returns the set of keys in the map.

Because a set is actually a map, all the map functions are available.

Any operation that constructs an instance of map(T, ()) is creating something that can be used as a set, for example map{'a':(), 'b':()}.

Perhaps we provide the syntax set{expr, expr, ...} which is equivalent to map{expr:(), expr:(), ...}. Except that duplicates should be ignored (eliminated) rather than causing an error.

benibela commented 1 year ago

set:of($keys) set:contains($set, $key) set:put($set, $key) set:remove($set, $keys) set:keys($set) set:filter($set, $predicate) set:for-each($set, $action) set:union($set1, $set2) set:intersection($set1, $set2) set:difference($set1, $set2) set:contains-all($set1, $set2)

that could be a 3rd party library. It does not have to be put in the spec

dnovatchev commented 1 year ago

set:of($keys) set:contains($set, $key) set:put($set, $key) set:remove($set, $keys) set:keys($set) set:filter($set, $predicate) set:for-each($set, $action) set:union($set1, $set2) set:intersection($set1, $set2) set:difference($set1, $set2) set:contains-all($set1, $set2)

that could be a 3rd party library. It does not have to be put in the spec

In principle, every feature can be implemented in a 3rd party library.

But actually the standard functions are here to provide the most fundamental features.

And the Set datatype is probably the most fundamental feature of Math, Science and will be such in anything XPath-based.

Having the notion of Set in our languages leads to multitude of improvements - not least of them is the opportunity to significantly reduce the size of our specification documents, replacing long and winding paragraphs, telling us that something has "implementation-defined order", by just simply saying that this is a set.

dnovatchev commented 10 months ago

I propose that we introduce sets as a specialization of maps. This is related but orthogonal to issue #119, which attempts to generalize what can be used as a key in a map: any improvements we make to maps will automatically apply to sets.

We introduce new ItemType syntax set(T) which is deemed equivalent to map(T, empty-sequence()). This implicitly defines all the type subsumption rules.

We define a new suite of functions to operate on sets, in a new namespace.
set:of($keys)
set:contains($set, $key)
set:put($set, $key)
set:remove($set, $keys)
set:keys($set)
set:filter($set, $predicate)
set:for-each($set, $action)
set:union($set1, $set2)
set:intersection($set1, $set2)
set:difference($set1, $set2)
set:contains-all($set1, $set2)
map:key-set($map) returns the set of keys in the map.

Because a set is actually a map, all the map functions are available.

Any operation that constructs an instance of map(T, ()) is creating something that can be used as a set, for example map{'a':(), 'b':()}.

Perhaps we provide the syntax set{expr, expr, ...} which is equivalent to map{expr:(), expr:(), ...}. Except that duplicates should be ignored (eliminated) rather than causing an error.

I definitely like this and propose to move forward with it.

I am also thinking that the way forward would be to introduce the hashable boolean qualifier (which we may decide to be true() by default for any atomic type and any other type except functions or collections that contain a function with no hash-value).

For any hashable type we want the user to be able to provide a specific hashing-function. This hashing function will be used to produce the hash-value of any instance of that type - explicitly upon construction, or on demand whenever the hash-value of any instance of that type is needed.

If no hashing-function for a hashable type is explicitly provided, then an implementation-defined hashing algorithm is used to produce the hash-value for any item of this type.

The user may also decide to provide a unique hash-value for any specific function-item.

We need this to be possible to express in XPath, and may want to further discuss the best syntax.

qt4cg / qtspecs

Proposal to introduce the set datatype in XPath 4 #34