qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Get the type of a value #148

Open line-o opened 1 year ago

line-o commented 1 year ago

This could be a language construct like type of $a similar to $a instance of <type> or a function fn:type-of(item()*).

It returns the string representation of the type of a value.

type of "", (: yields "xs:string" :)
type of [], (: yields "array(*)" :)
type of [map{"a": true(), "b": false()}, map{"c", false()}], (: yields "array(map(xs:string, xs:boolean))" :)
type of ([], map{}, function () {}]), (: yields "function(*)" :)
type of [map{"a": true(), "b": "false"}, map{"c", false()}] (: yields "array(map(xs:string, xs:anyAtomic))" or "array(map(xs:string, *))" :)

I wrote an implementation in XQuery as a PoC utilizing typeswitch:

https://github.com/line-o/xbow/commit/ca6e593f869c15b1fb372d24653715abfbda5cf8

There is an implementation in baseX inspect:type. inspect:type#2 allowing to set additional options which should be considered.

ChristianGruen commented 1 year ago

Maybe a simple function is sufficient, and we could add the occurrence indicator. In BaseX, we have inspect:type.

line-o commented 1 year ago

I do like the functions the inspection module provides, @ChristianGruen. So yes, a function fn:type($var) or fn:inspect-type($var) would suffice. And it would make sense to also include the options map. That way we could limit the depth of the traversal. I would even propose to standardise fn:inspect-function($function as function(*)) as element(). I did not see a proposal for this yet.

line-o commented 1 year ago

The inspect:type function with the options "mode" and "item" as implemented in baseX are very well thought out. @ChristianGruen what do you think about the "depth"-option in addition to "mode": "computed"?

michaelhkay commented 1 year ago

This has been rejected in the past.

Firstly, there is no such thing as a "most specific type". The type system is a lattice.

See also https://github.com/w3c/qtspecs/issues/14.

For example, what would you return for an empty map? It's an instance of both map(xs:anyURI, element(banana)) and map(xs:duration, array(s:integer)) - is either of these more specific than the other?

Secondly, it allows you to discover details of the implementation that should be hidden. For example, abs() applied to an xs:integer might return an xs:non-negative-integer or perhaps even a saxon:64-bit-unsigned-integer. Users shouldn't be encouraged to go discovering such internal details.

dnovatchev commented 1 year ago

Secondly, it allows you to discover details of the implementation that should be hidden. For example, abs() applied to an xs:integer might return an xs:non-negative-integer or perhaps even a saxon:64-bit-unsigned-integer. Users shouldn't be encouraged to go discovering such internal details.

Then why many languages provide such functions to the users:

For example, what would you return for an empty map? It's an instance of both map(xs:anyURI, element(banana)) and map(xs:duration, array(s:integer)) - is either of these more specific than the other?

If a generic object has no value (is empty) then it can have all possible generic types, which means its type is their intersection (most common super-type), and this will result in the type:

*`map()`**

Returning this is still much more useful than not providing the functionality at all.

Related to this: We do need a type object in XPath. This will give us a proper answer to the question:

What is the right-hand-side (2nd argument) of the operator instance of ?

ChristianGruen commented 1 year ago

@line-o Could you share an example for the proposed depth option?

michaelhkay commented 1 year ago

Firstly, object oriented languages define classes as objects.

Secondly, they have a type system where every object belongs uniquely to one particular class. It's not always predictable, because methods can return a subtype of the declared type, but that's manageable because you can explore the type hierarchy programmatically to get the information you want. Without that ability, discovering that a function like abs() has returned an instance of saxon:fairly-small-positive-integer is useless. Especially if it returns something different the next day.

The XPath type system isn't like that. Parts of it are, for example atomic values are "labelled" with a specific type. But sequences aren't, and maps aren't.

To be honest, the type system in XDM is rather a mess, and it's very easy for proposals like this to make things worse rather than better.

But I'd suggest if you want this then you start by producing a complete, fully articulated proposal that we can critique. Providing use cases showing how the feature can be used to solve real problems would also enable validation that it will actually be usable.

(Saxon, incidentally, does have functions like this: see https://www.saxonica.com/documentation11/index.html#!functions/saxon/type and https://www.saxonica.com/documentation11/index.html#!functions/saxon/type-annotation . The specification of these extensions falls well short of what would be needed in a standard, yet it's already pretty complex, and it would be hard to make it interoperable. The library exposes type information in the form of functions. These days I would be quite inclined to expose it in the form of maps; except that the maps would contain cyclic references to other maps, and cyclic maps are not something that XDM has ever envisaged (JSON serialization, for example, would be non-terminating).

ChristianGruen commented 1 year ago

Secondly, it allows you to discover details of the implementation that should be hidden.

@michaelhkay I can imagine that users may also be tempted to stumble upon implementation details when playing around with instance of. If we wanted to avoid that, we’d probably need to revise functions like fn:abs and enforce unique return types. Instead of saxon:64-bit-unsigned-integer, we should certainly limit the output to types that are defined in the specifications.

But it will certainly take some time to define reasonable string representations for function items. For empty maps, I agree with @dnovatchev that map(*) is the output most users would expect.

line-o commented 1 year ago

Let me just add that I added

stop traversal as early as possible, when the first mismatching type is encountered

so as to return with generic types like map(*), array(*) and function(*) as quickly as possible. But I see the point that it needs to be said explicitly. My naive implementation in XQuery does exactly that.

line-o commented 1 year ago

@ChristianGruen for the depth-option: this controls the for how deep to traverse structures like sequences, arrays, maps and probably also records and sets when they are added.

let $struct := [ [ [ map {}, map{} ] ] ]
return (
  fn:type-of($struct, map{"depth": 1}),
  fn:type-of($struct, map{"depth": 2}),
  fn:type-of($struct)
)
"array(*)", "array(array())", "array(array(array(map(*))))"
michaelhkay commented 1 year ago

I don't understand the "traversal" process. It seems to be assuming navigation of some kind of type hierarchy. Even for nodes, there is no such hierarchy: element(a:b, xs:int) has multiple supertypes including element(*, xs:int) and element(a:b, xs:long). We need a spec, please.

michaelhkay commented 1 year ago

A viable spec would be something like:

For atomic values, return the local name of the most specific built-in (XSD) type of which the value is an instance.

For nodes, return node-kind(name) where node-kind is for example "element" or "attribute" and name (for nodes that have a name) is the node name in Q{uri}local notation.

For maps return map(*), for arrays return array(*)

For other functions return a string representation of the function signature that (a) conforms to XPath SequenceType syntax, and (b) uses EQName notation for qualified names.

line-o commented 1 year ago

I edited the description to reflect the input from @michaelhkay and @ChristianGruen

ChristianGruen commented 1 year ago

For arrays and maps with members and entries the implementation can attempt to find the most specific shared type

Does “can” imply that the behavior of the function is supposed to be implementation-defined?

The inspection should stop as early as possible, when the first mismatching type is encountered.

Could you elaborate what “mismatching type” means in this context?

michaelhkay commented 1 year ago

The phrase "most specific shared type" has no meaning. It suggests that given two types T and U, neither of which is a subtype of the other, and given the set of types S consisting of all those types of which both T and U are subtypes, then there is one type in S that is a subtype of every other type in S. This is simply not true. For example, given T=schema-element(E), and U=schema-element(F), where E and F are both in the substitution group G and also in the substitution group H, then S includes both schema-element(G) and schema-element(H), neither of which is necessarily a subtype of the other.

The word "inspection" seems to be referring to a process or algorithm that can "stop". This algorithm is nowhere described.

line-o commented 1 year ago

I will try to describe the algorithm that I implemented in xbow:get-type#1 (see link in description).

michaelhkay commented 1 year ago

I would suggest breaking this up into a number of useful simpler functions.

fn:node-kind($node as node()) returns one of the strings "document", "element", "attribute", "text", "comment", "processing-instruction", or "namespace".

fn:type-annotation($item as item()) returns information about the type annotation of an atomic value or node (see XDM for definition) in the form of a map. (If $item is not an atomic value or a node it returns an empty sequence.) The map includes entries:

-- name - a QName, the name of the type, present if the type is not anonymous

-- base-type() - a function returning the same information for the base type

-- variety - "simple" or "complex"

etc.

fn:distinct-types($input as item()*) - return the result of applying fn:type-annotation(.) to each item in $input, with redundant entries removed. An entry is redundant if there is another entry for the same type or for a supertype.

dnovatchev commented 1 year ago

Just FYI: Here is how FXSL's function f:type , implemented almost 20 years ago, gives us what is actually the most derived type of an instance.

This function was very useful and instrumental at least withing FXSL.

michaelhkay commented 1 year ago

Yes, getting the most specific built-in atomic type for a supplied atomic value is straightforward. The only complication is that the result might not be interoperable, for example 2+2 is required to return an xs:integer, but it is permitted to return an xs:int or xs:positiveInteger. This creates the danger that people will write type(2+2) = QName('xs:integer') and wrongly assume this is always the same as saying (2+2) instance of xs:integer.

[Saxon rarely takes advantage of this freedom. But there are occasions when it does so, for example abs($x) returns $x unchanged if it is non-negative.]

dnovatchev commented 1 year ago

@michaelhkay

2+2 is required to return an xs:integer, but it is permitted to return an xs:int or xs:positiveInteger

This statement seems to be logically inconsistent. If this is part of any spec document, this is clearly a bug in that spec.

If some specific result is r e q u i r e d to be returned, then how can something else be returned and not cause an error?

michaelhkay commented 1 year ago

If it returns an xs:int then it satisfies the requirement to return anxs:integer, because every xs:int is an xs:integer.

dnovatchev commented 1 year ago

If it returns an xs:int then it satisfies the requirement to return anxs:integer, because every xs:int is an xs:integer.

Then the stated requirement must be: "xs:integer or any sub-type of it", or in other words: "any node of the subtree of types, rooted in xs:integer"

michaelhkay commented 1 year ago

The actual language of the spec (F+O §1.4) is "The return-type, also in italics, specifies the static type of the value returned by the function. The dynamic type of the value returned by the function is the same as its static type or derived from the static type."

I'm afraid the language of this sentence falls into the common trap of suggesting that we can speak of "the dynamic type" of a value, which is inaccurate. We also shouldn't talk of one type being derived from another, we should use the "is subtype of" relation defined in XPath §3.7.

If the required type is xs:integer+, then a result of (xs:int(2), xs:long(3), xs:positiveInteger(5)) matches this required type, but there is nothing that tells us what "the dynamic type" of this sequence is. I've been working on correcting some of these references. We should never say "the [dynamic] type of the value V is T", we should always say "V is an instance of [or 'matches'] T".

michaelhkay commented 1 year ago

I've been thinking about this and wondering if (at least for items rather than values) we could find a way of defining that every item belongs to exactly one "essential" type plus any number of "accidental" types [1].

For atomic values it's easy enough: the essential type is the one identified by the type annotation, the accidental types are its supertypes in the atomic type hierarchy and all union types that include any of these as a member type.

For nodes, perhaps the essential type is the combination of node kind, name, and type annotation, for example element(N, T). But perhaps the nilled property comes into it as well. Certainly membership of types schema-element(E) or schema-attribute(A) could be considered accidental. Note: it's not the case that the accidental types are always supertypes of the essential type.

For function items, the essential type is defined by the function signature. Function annotations presumably are accidental.

For arrays and maps, I think the essential type is simply array(*) or map(*). Anything more specific, such as array(union(xs:int, xs:string)) is accidental.

This approach might lead us to a position where we can indeed say that every item has an unambiguously-defined "essential type" which we could regard as a property of the item. The question then becomes, if we want to return the "essential type" as the result of a function, how to we represent types as XDM values?

We do in fact in Saxon have a map-based XDM representation which can faithfully represent every SequenceType. It wasn't designed for public consumption, but we could use that as a starting point. However, it's a rather shallow representation in that it (like the SequenceType syntax) references schema types by name, and it's not clear how useful that is unless you can interrogate the properties of schema types - which would mean having an XDM representation of the entire Schema Component Model.

[1] https://plato.stanford.edu/entries/essential-accidental/

dnovatchev commented 1 year ago

Perhaps looking at the C# Type class could be helpful.

Not saying we must copy verbatim, but there are many useful things that might give us good ideas, like:

dnovatchev commented 11 months ago

There was a recent discussion on the Xml.com chat and @line-o mentioned this issue.

The fact is that users continue to express their need for a type object (as this slack discussion showed) and we need to do something to respond to those users' needs.

Things have also changed since the date when this issue was initially submitted. For example, we have now the record type, and record-type tests in the language. And a record can be self-referential, thus it seems at least some of the difficulties cited above by @michaelhkay have been overcome.

As a user I need to be able to refer to a type of a value, like type-of(someXdmValue).

We need finally to be able to have: $something instance of $someType and we also need to be able to define a variable as:

$myVar as type-of($someObject)

Isn't it high time to stop picking only the low-hanging fruit for our users, while leaving the real treasures perhaps as "a bunch of sour grapes that are not worth gaping for" ?

https://read.gov/aesop/media/the-fox-and-the-grapes_Resources/the-fox-and-the-grapes.mp4

michaelhkay commented 11 months ago

Fundamental changes to the type system are going to be extremely hard to achieve. Saying that users need something is easy; defining something that meets that need (while retaining compatibility) is much harder.

dnovatchev commented 11 months ago

Fundamental changes to the type system are going to be extremely hard to achieve. Saying that users need something is easy; defining something that meets that need (while retaining compatibility) is much harder.

Maybe a series of small steps would be more realistic?

This is just a brief sketch of the process of trying to solve the problem.

Any unsurmountable difficulties?

michaelhkay commented 11 months ago

Well, there are several difficulties there.

Given <xsl:variable name="x" as="xs:anyAtomicType" select="42"/> I would expect type-of($X) to be the same as type-of(42) (by basic substitutability principles).

It's a fallacy to speak of "the common base type". The type system is not a hierarchy (in fact, it's not even a lattice). We spent a lot of time eliminating this bug, which appeared in a number of places in the 2.0 specs.

dnovatchev commented 11 months ago

Well, there are several difficulties there.

Given <xsl:variable name="x" as="xs:anyAtomicType" select="42"/> I would expect type-of($X) to be the same as type-of(42) (by basic substitutability principles).

It's a fallacy to speak of "the common base type". The type system is not a hierarchy (in fact, it's not even a lattice). We spent a lot of time eliminating this bug, which appeared in a number of places in the 2.0 specs.

We probably can agree and establish a main type in such cases, for example that the main type of 42 is xs:integer.

This gives us at least a specific type to work with, and is better than not having any type at all,

dnovatchev commented 11 months ago

Well, there are several difficulties there.

Given <xsl:variable name="x" as="xs:anyAtomicType" select="42"/> I would expect type-of($X) to be the same as type-of(42) (by basic substitutability principles).

It's a fallacy to speak of "the common base type". The type system is not a hierarchy (in fact, it's not even a lattice). We spent a lot of time eliminating this bug, which appeared in a number of places in the 2.0 specs.

This was not a problem in C#. Given this code snippet:

var nnnn = 4;

var t = nnnn.GetType();

How does the C# compiler decide that the type of nnnn is int and not one of the other 7 types that 4 is instance of?

image

Maybe we could learn from the solution, and try to do the same here?

In particular they are using this specific rule:

The type of an integer literal is determined by its suffix as follows:

michaelhkay commented 11 months ago

The type system of C#, like other object-oriented languages, is completely different from that of XDM.

rhdunn commented 11 months ago

@dnovatchev Given a C# statement like:

var x = 42

the C# language assigns that variable a type of int. The equivalent in XPath is:

let $x := 42

where 42 has the associated type of xs:integer due to the https://www.w3.org/TR/xpath-31/#id-literals section.

Where C# and XPath differ is that in C# the variable has a specific type, such that:

x is Integer // true
x is Short // false

However, in XPath, the instance of rules use subtype matching:

$x instance of xs:integer (: true :)
$x instance of xs:short (: true -- $x is in the range of the short value space :)

This means that if you have $y with a value of 32768 then:

$y instance of xs:integer (: true :)
$y instance of xs:short (: false -- $y is larger than the max short value :)

This is similar to the duck typing approach in Python, where if a type has the correct methods, etc. then passing it to a function will work.

XPath has well-defined instance of (subtype) and cast (casting) rules for working with variables and values, but those operate on the value space of the instance not the underlying representation of that value. That is what makes these rules complex.

When a variable has a declared type, you need to think about the value space that type represents, not the type itself. Given an xs:short value, you don't know if the value is an xs:byte (i.e. if it is in the range -127 to 128), but you do know that it is an xs:long (as the value space of xs:short is contained in the value space of xs:long).

dnovatchev commented 11 months ago

The type system of C#, like other object-oriented languages, is completely different from that of XDM.

Seems that what I wanted to say was not understood. To quote the C# rules for determining the type of an integer literal:

So, there is a convention about which type to assign to the variable, out of many possible types. Thus, although 42 is small enough to be of type int16 or even byte, these are not even considered in making the type inference.

When we say that an XPath literal value can be assigned many types, can't we follow the same solution that the designers of C# have employed? Thus we will order the desired types in a predefined sequence, and choose the first type of this sequence, of which the literal value is an instance of.

See also the C# language Specification

rhdunn commented 11 months ago

https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-literals does what C# does:

The value of a numeric literal containing no "." and no e or E character is an atomic value of type xs:integer.

Therefore, for let $x := 42 the variable x has the type xs:integer by those rules like var x = 42 has the type int from C#'s literal value rules. Similar for other numeric and string constants.

Where C# and XPath differ is in the subtyping and casting rules. It's the subtyping rules which create the type lattice/graph. XPath works similar to C/C++'s integer promotion and automatic type conversion.

The following sections are useful for understanding types in XPath:

  1. https://qt4cg.org/specifications/xpath-datamodel-40/Overview.html#types-hierarchy -- note that xs:integer is both an xs:anyType and an item(), and xs:IDS is both an xs:IDS and xs:ID*
  2. https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-sequencetype-matching -- the derives-from function for atomic type matching
  3. https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-sequencetype-subtype -- the subtype rules used by instance of, etc.
  4. https://qt4cg.org/specifications/xpath-functions-40/Overview.html#casting -- the rules for casting from one type to another
  5. https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-coercion-rules -- the rules for converting from a given value to a type, e.g. when passed to a function parameter with a specified type

Any proposal needs to work with the above rules and specification language.

dnovatchev commented 11 months ago

https://qt4cg.org/specifications/xquery-40/xpath-40.html#id-literals does what C# does:

The value of a numeric literal containing no "." and no e or E character is an atomic value of type xs:integer.

Therefore, for let $x := 42 the variable x has the type xs:integer by those rules like var x = 42 has the type int from C#'s literal value rules. Similar for other numeric and string constants.

Where C# and XPath differ is in the subtyping and casting rules. It's the subtyping rules which create the type lattice/graph. XPath works similar to C/C++'s integer promotion and automatic type conversion.

@rhdunn I am confused. Are you saying that there is no problem in XPath to give a specific type to any value? And that the difficulties are only in the subtyping rules?

rhdunn commented 11 months ago

I was answering your comment around the link and C# spec text you referenced, saying that XPath already has the language in the specification you are asking for.

There is static type of a variable (as specified in the static context), and a dynamic type (as specified in the dynamic context).

The dynamic type definition states:

Every value matches one or more sequence types. A value is said to have a dynamic type T if it matches (or is an instance of) the sequence type T.

Because of that, a variable can have multiple dynamic types. Because a type-of($x) would work on the dynamic type of the variable/value, that would not return a single type. -- This is the issue that Mike is trying to explain.

michaelhkay commented 11 months ago

We already determine the type of a literal value from its lexical form, for example 1 is an xs:integer, 1.0 is an xs:decimal, 1e6 is an xs:double. That's not the issue here. The issue is that (unlike C# or other OO languages) there are many values, of which the most obvious example is an empty map, that do not have a unique "most specific" type.

dnovatchev commented 11 months ago

We already determine the type of a literal value from its lexical form, for example 1 is an xs:integer, 1.0 is an xs:decimal, 1e6 is an xs:double. That's not the issue here. The issue is that (unlike C# or other OO languages) there are many values, of which the most obvious example is an empty map, that do not have a unique "most specific" type.

We already have discussed the nothing() type that is suitable for things like empty collections.

And for any other value that has N possible types, the solution that works for C# will also be applicable here: Put all such types in an ordered structure and select the first type from that ordered structure (when traversed following its order) that happens to be a matching type for the value.

For every type we have a corresponding constructor function, and using it we also solve another problem: create a new value that has the same type as a given, specific value.

One could argue that the ordering of the types would be rather arbitrary and in any case not well justified, but choosing a seemingly good ordering would be a step in the right direction.

michaelhkay commented 11 months ago

Put all such types in an ordered structure

We're going round in circles. The type system is a directed acyclic graph. There is no ordering.

dnovatchev commented 11 months ago

Put all such types in an ordered structure

We're going round in circles. The type system is a directed acyclic graph. There is no ordering.

I meant: impose an ordering of our own choosing.

And here is what Wikipedia says about this:

"_A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time_."

michaelhkay commented 11 months ago

A DAG has many topological orderings, and the one you will get is not in general "one of our own choosing", in general it is arbitrary. A type-of() function that returns one of the types an item belongs to, selected arbitrarily, is not going to be useful to anyone.

As for performance, the time to compute an ordering of the entire type system is irrelevant when there's an infinite number of types involved. That's not actually what's needed here; what's needed is an algorithm that selects a type T that an item belongs to, such that no other type U that the item belongs to is a subtype of T.

In practice, the case that's most difficult arises with functions. Consider an array of functions with the same arity but different signatures. What should be the type of such an array? The easy pragmatic solution is to return something like array(function(*)). But do you also do that if the functions all have the same signature?

dnovatchev commented 11 months ago

A DAG has many topological orderings, and the one you will get is not in general "one of our own choosing", in general it is arbitrary. A type-of() function that returns one of the types an item belongs to, selected arbitrarily, is not going to be useful to anyone.

Yes, and maybe we could examine these, possibly many, orderings and come up with one "preferred ordering"?

If no one has attempted to do this until now, it is not justified just to claim in advance that "trying to do this is meaningless and doomed to failure from the very start".

ChristianGruen commented 11 months ago

I think there are two different (though related) goals in this discussion:

  1. Provide occasional users with an easy way to get a string representation for the most common types (similar: typeof in JavaScript, getClass().getName() in Java).
  2. Create a result that does full justice to the sophisticated XDM type system, and for which the actual data may need to be analyzed in more depth.

Objective 1 seems easy enough, good answers have already been given in the initial thread and in https://github.com/qt4cg/qtspecs/issues/148#issuecomment-1439271691. I think that most people would already be happy to get xs:integer, element(), map(*) and function(*), etc. or the corresponding super types if we support sequences (xs:anyAtomicType, node(), item(), etc.).

Objective 2 is much more difficult, and the generated result may not necessarily be helpful to everyone, as it would require everyone to fully understand the implications of the type system (who wants that in practice…). Maybe we can start with 1) and provide additional options in a separate step.