qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
29 stars 15 forks source link

Simulating Objects: Performance #961

Closed ChristianGruen closed 8 months ago

ChristianGruen commented 10 months ago

Related to #953, #917 and #916, I wonder whether we are aware enough of the essential differences when we think of objects in a functional language:

This thread is not about premature optimization; I just want to be sure we think about the obstacles when using maps for objects. Maybe the solutions are already on the horizon; maybe we could tackle some of the concerns with the definition of default values…

declare record person(
  name   as xs:string,
  title  := (),
  full   := fn { string-join((?title, ?name), ' ') }
);

…and maps with type annotations. If we don’t materialize defaults, the embedded annotation would indeed need to effect functions like map:get, as questioned by Michael in https://github.com/qt4cg/qtspecs/pull/953#issuecomment-1896078605.

michaelhkay commented 10 months ago

Immutable data structures need to be fully copied if a single value changes.

No, that's not true. immutable/persistent maps and lists are slower than the mutable variety, but they still allow updates (i.e. creation of new versions) in constant or logarithmic time, not in time proportional to the size of the structure. (In SaxonJ we use Michael Froh's ImmutableHashTrieMap; in SaxonCS we use the System.Collections.Immutable.Dictionary class that comes with the .NET platform.)

It's also possible that one could define a custom map structure that puts functions in one substructure and other fields in another, in the expectation that the functions are unlikely to be modified.

Generally, I think most specification decisions that tried to take performance into account have proved to be a mistake. If the specification is clean, implementors will usually find a way to optimize it. (But there are exceptions of course. I've tried very hard to implement "virtual" copying of XDM nodes without physical copying the whole tree, and except in a few special cases I haven't been very successful. But copying of maps and arrays, where we don't have the problem of node identity and parentage, is much easier and more efficient.

ChristianGruen commented 10 months ago

No, that's not true. immutable/persistent maps and lists are slower than the mutable variety, but they still allow updates (i.e. creation of new versions) in constant or logarithmic time, not in time proportional to the size of the structure.

Sorry for being inaccurate. With the unfortunate “fully copied”, I wanted to point out that a new reference is created and additional memory is required every time an update takes place, even if the data structure needn’t be duplicated. – We use our own HAMT implementation (which may be similar to Michael Froh’s implementation). It’s impressively efficient (also thanks to the excellent job of garbage collectors), but of course it cannot compete with low-level updates of fixed memory slots.

If the specification is clean, implementors will usually find a way to optimize it. […]

My experience is that the variety of feasible optimizations depends a lot on the specification: Even though it’s probably fair to say that the XQuery 1.0 specification is a clean one, implementors and users had to wait until version 3.1 to be able to reach a new level of performance with maps. With FLWOR expressions (due to their flexibility), it’s extremely easy to write O(n²) queries that will be slow with any available processor, and it will be up to the user to speed them up, e.g. by creating auxiliary data structures like maps to achieve a quasilinear runtime.

In other words, the question may be when we want to call our specification clean. I think there are still a few steps to be taken before we reach that stage, and the most important step might be to improve type safety.

The other general question may be how fast we actually want to be or get. X languages won’t be the first choice anyway if performance is a top priority, so perhaps we shouldn’t be too ambitious, and be satisfied with the assumption that we’ll probably be “fast enough” (after I had written machine code, Java felt like a slow duck to me, but obviously no XQuery implementation exists today that outperforms Java… and so on).

dnovatchev commented 10 months ago
  • the update of a map with, let’s say, 1 string and 50 functions would be a new map with 1 string and 50 functions. Even with efficient immutable map implementations that we have, I doubt that it makes sense to create full copies with 1+50 entries, of which only 1 string will be different.

Not true!

The functions need not be copied. Also the unchanged data (not-functions) entries don't need to be copied with a proper implementation. An insert in a tree can be done with sharing the common sub-trees. For example, see Kris Okasaki's "Purely Functional Data Structures.

image

ChristianGruen commented 10 months ago

Not true!

I had hoped my last answer was clear enough.

dnovatchev commented 10 months ago

As a developer, if I am creating records/maps with functions that can access the map via their first, implicit argument, I will never set a goal that any such thing (still hesitating to call it "object") will have methods to "update itself".

Of course, creating a new object from an old object is an expensive operation in any language, and thus the need for such an operation must be well-justified.

There are various ways to do this efficiently - functions can be overriden, and the whole original object can be kept intact with the new object referencing the original object and calling the old object's functions with a dispatch via this reference.

benibela commented 10 months ago

The functions need not be copied. Also the unchanged data (not-functions) entries don't need to be copied with a proper implementation. An insert in a tree can be done with sharing the common sub-trees. For example, see Kris Okasaki's "Purely Functional Data Structures.

That still needs O(log n) to do what is O(1) in a classical array

Btw, I have implemented my own HAMT for this. One of the most complicated things I have implemented. In benchmarks the HAMT was faster than most mutable hashmaps. But I do not use it. Since it is so complicated, I am not sure it is bug-free.

dnovatchev commented 10 months ago

The functions need not be copied. Also the unchanged data (not-functions) entries don't need to be copied with a proper implementation. An insert in a tree can be done with sharing the common sub-trees. For example, see Kris Okasaki's "Purely Functional Data Structures.

That still needs O(log n) to do what is O(1) in a classical array

Btw, I have implemented my own HAMT for this. One of the most complicated things I have implemented. In benchmarks the HAMT was faster than most mutable hashmaps. But I do not use it. Since it is so complicated, I am not sure it is bug-free.

O(log n) is absolutely acceptable - 20 operations instead of 1 000 000 operations - for n == 2

MarkNicholls commented 10 months ago

As a bit of context from a language that has embraced immutable records.

in F# (where 'records' are immutable by default) I think performance is rarely a problem vs a C# mutable version of the same thing.

There are specific scenarios where mutability IS an issue, but because of complexity/syntax rather than performance, e.g. updating a node in a tree in a mutable world is trivial but in a immutable one, you have to resort to a "zipper", which in terms of performance is comparable but in terms of coding is an significant extra complication.

What is less rare is the everyday syntax headache, if you want to update a child of a child of a child of something in a mutable world you go (apologies for the foreign code, it should be self explanatory).

type Town = { mutable name: string }
type Road = { name: string; town: Town }
type Address = { number: int; road: Road }

let myAddress = { number = 10; road = { name = "Downing St"; town = { name = "London" } }}

// ah I've moved to the same address in Edinburgh!...how convenient
myAddress.road.town.name <- "Edinburgh"

easy.

in an immutable world its much more clunky, because you have to effectively recreate a whole new value from the bottom up.

// ew how inconvenient.
let myNewAddress = { myAddress with road = { myAddress.road with town = { name = "Edinburgh" }}}

There are esoteric mechanisms to get rid of this, but like "zippers" for trees, or all the other FP magic in Okasaki the issue isnt performance but complexity of code, and the required skills for the developer.

F# sidesteps the issue by allowing mutable records, that MAY be worth considering, but I image the consequences will leak.

i.e. to me this (@benibela)

"Since it is so complicated, I am not sure it is bug-free."

is more of an issue, than performance (though i accept that the code thats being referenced was created for performance reasons)

michaelhkay commented 10 months ago

I've been working on how to move forward with deep update, see issue #832.

I think it needs custom syntax, something like

update $myaddress replace ??town with "Edinburgh"

<xsl:update select="$myaddress" replace="??town" with="'Edinburgh'"/>

Using a zipper "under the hood" to implement it. I think it can be made easy enough for users to use; the complex part is defining the semantics of how it actually works.

We've had an implementation in Saxon for some time as an XSLT extension instruction - see https://www.saxonica.com/documentation12/index.html#!extensions/instructions/deep-update - I don't think it's widely known or used, which may be partly because the documentation is fairly horrible.

(At the implementation level, "recreating a whole new value from the bottom up" wasn't difficult for a single update; the main challenge was merging the effects of multiple updates., which is extremely difficult to achieve at the user level, unless you take care to change one thing at a time.)

MarkNicholls commented 10 months ago

I've been working on how to move forward with deep update, see issue #832.

I think it needs custom syntax, something like

update $myaddress replace ??town with "Edinburgh"

<xsl:update select="$myaddress" replace="??town" with="'Edinburgh'"/>

so how would updating a child of a child look i.e. the equivalent of this

let myNewAddress = { myAddress with road = { myAddress.road with town = { name = "Edinburgh" }}}

is it basically the same grammar but in xslt?

i.e. can it do nested updates

(not trying to nag, just I think ideally the syntax needs to support nesting the update, even though, as I've pointed out, this nested syntax is no utopia)

michaelhkay commented 10 months ago

I'm not sure what your data looks like but I think It's just

<xsl:variable name="myNewAddress" as="map(*)">
   <xsl:update select="$myaddress" replace="?road?town?name" with="'Edinburgh'"/>
</xsl:variable>
MarkNicholls commented 10 months ago

I'm not sure what your data looks like but I think It's just

<xsl:variable name="myNewAddress" as="map(*)">
   <xsl:update select="$myaddress" replace="?road?town?name" with="'Edinburgh'"/>
</xsl:variable>

nice....I dont pretend to understand it, except in a very vague way, i was expecting something more recursive, so I assume you effectively write a 'path' to the thing you want to change (which the zipper follows), so in an OO language this would be (i.e. as per my mutable example).

myAddress.road.town.name <- "Edinburgh"

how would you have "multiple updates", isnt it pretty much defined to be a path to an atomic value (unlike the F# record syntax where you can update multiple things, which is more of a tree like syntax, which i think is the reason its so clunky)?

ChristianGruen commented 10 months ago

That still needs O(log n) to do what is O(1) in a classical array

@benibela I agree, the difference can matter; it always depends on the use case and how often you perform an operation. If O(log n) is 20, it still means that a program is 20 times slower than one that provides O(1).

And @MarkNicholls Thanks for repeatedly bringing up the concepts of F# in this and in older threads. I’d love to use it myself (haven’t had the chance yet), as I think we can learn a lot from it. In a way, it has more parallels to our X languages than other classical functional languages like Haskell.

For arrays and sequences, we use a Finger Tree (in short, a tree-based generalization of the zipper concept), which makes updates (inserts: $seq, 1, deletions: fn:remove, etc.) and lookups very cheap, compared to trivial data structures. Out of interest: When you append a value to an array, does F# use zippers internally, or do you have to spend more time and effort by yourself to enable them?

Unlike F#, I think it was the right choice so far to stick with an immutable approach, and avoid mutable data structures. It would be a very fundamental change.

I like to focus on performance, as it repeatedly happens that developers, for good reasons, decide against XQuery 3.1/XSLT 3.0 because of performance concerns. Hardware is still improving, but even now, due to the inherent requirements of the language, a map with 1 million integer keys (and a constant value) takes about 80 MB of main memory with XPath, whereas it can be realized with about 20 MB with Java (if integers are stored as primitives), and it takes around 20 times longer to build (both, more or less, with SaxonJ and BaseX).

But again, the question is what we together regard as fast enough and as acceptable. For XML input, (I hope) our languages will always be the natural choice, and usually it doesn’t matter if a custom and low-level solution might be 10 times faster in another language.

ChristianGruen commented 10 months ago

nice....I dont pretend to understand it, except in a very vague way, i was expecting something more recursive, so I assume you effectively write a 'path' to the thing you want to change (which the zipper follows), so in an OO language this would be (i.e. as per my mutable example).

myAddress.road.town.name <- "Edinburgh"

For XML data, XQuery Update can be used. The syntax is verbose but intuitive:

replace node $myAddress/road/town/name/text() with 'Edinburgh',
insert  node <name>Johnstown</name> into $myAddress/road/town,
delete  node $myAddress/road/town,
...
michaelhkay commented 10 months ago

how would you have "multiple updates"

For a multiple update consider

<xsl:variable name="myNewAddress" as="map(*)">
   <xsl:update select="$myaddress" replace="?road?*?name" with="'Edinburgh'"/>
</xsl:variable>

I assume you effectively write a 'path' to the thing you want to change

We effectively evaluate ?road?*?name in "tracking mode" where we not only find the items identified by the expression, but keep track of their location in the tree, so that we can reconstruct the new tree by working our way back up. (I invented this mechanism without being aware of "zippers", so the terminology is far weirder than it needs to be.)

michaelhkay commented 10 months ago

As regards performance, we have so far done no serious work on map/array performance optimisation. Our performance work is nearly all driven by customer workloads, and until now that has always meant XML workloads rather than JSON.

MarkNicholls commented 10 months ago

@ChristianGruen

Ralf Hinze, the co-author/inventor of the finger tree, taught me most of what I know about functional programming.

For me there is a sweet spot for Xml processing, else I wouldnt haunt you here.

If you have time see https://fsprojects.github.io/FSharp.Data/library/XmlProvider.html. For me that is my alternative Xml tooling, its significantly quicker, but this isnt really an issue for me.

MarkNicholls commented 10 months ago

how would you have "multiple updates"

For a multiple update consider

<xsl:variable name="myNewAddress" as="map(*)">
   <xsl:update select="$myaddress" replace="?road?*?name" with="'Edinburgh'"/>
</xsl:variable>

I assume you effectively write a 'path' to the thing you want to change

We effectively evaluate ?road?*?name in "tracking mode" where we not only find the items identified by the expression, but keep track of their location in the tree, so that we can reconstruct the new tree by working our way back up. (I invented this mechanism without being aware of "zippers", so the terminology is far weirder than it needs to be.)

@michaelhkay

Actually the cogs have turned in my head about this.

This is quite a change of tack from XSLT?

normally things like this would be implemented (by a end user developer) using the recursive pattern etc? (which actually for me is a USP of XSLT).

so does this new syntax work for 'xml' too?

and if so, then the thought is why would i use one and not the other?

(and why wouldnt you use this mechanism in your saxon code under the hood so that you don't have to resort to zippers, i would imagine passing the replace path recursive down the recursive template, I think it would work quite elegantly out of the box).

(as a side issue, effectively objects are implicitly lazy fixed points, and this opens the door to recursive (unbounded) data types, so there is also the issue of never ending recursion in general, though I think thats a 'buyer beware' issue)

michaelhkay commented 10 months ago

This is quite a change of tack from XSLT?

Yes. The traditional way to make a small change to an XML tree is with the recursive shallow copy design pattern. This can also be used with JSON trees - we've added a number of changes in 4.0 to make it easier, such as on-no-match="shallow-copy-all".

In XML, because of the existence of node identity and parentage, it's very hard to implement small changes to large trees (like adding one attribute) without physically copying all the data that hasn't changed. With trees derived from JSON, which don't have node identity and parentage, this should be possible. But the recursive-shallow-copy pattern doesn't lend itself (easily) to this kind of update operation. The introduction of an xsl:update instruction therefore has two objectives: to allow simple changes to be expressed in one instruction rather than using the more verbose apply-templates approach; and to make those changes a lot more efficient.

Providing the same syntax for XML trees is certainly on the cards. Whether it can be implemented so that small changes take constant time is an open question. We do have a "virtual tree" mechanism in Saxon that might be extended to some such cases, for example changes that only involve adding, removing, or replacing attributes. Essentially this is doing lazy evaluation of tree updates.

Another option worth mentioning for completeness, but probably too disruptive to contemplate, is to support a tree representation of XML without node identity and parent pointers. I have experimented with using such tree representations transiently during tree construction, with the parent pointers being added on first retrieval access to the tree.

MarkNicholls commented 10 months ago

@michaelhkay

the cogs are turning

(At the implementation level, "recreating a whole new value from the bottom up" wasn't difficult for a single update; the main challenge was merging the effects of multiple updates., which is extremely difficult to achieve at the user level, unless you take care to change one thing at a time.)

I think if you explicitly used a zipper, this would come out in the wash.

you would recursive traverse the tree moving the 'cursor' on the zipper to the current locaiton until you find matching nodes, you amend them, and move the cursor back on return, when you return to the root you should have an multi updated tree.

its virtually the canonical use case.

MarkNicholls commented 10 months ago

@ChristianGruen

Out of interest: When you append a value to an array, does F# use zippers internally, or do you have to spend more time and effort by yourself to enable them?

So as mentioned in F# you are encouraged to use persistent (in this sense https://en.wikipedia.org/wiki/Persistent_data_structure), rather than be completely 'pure'.

I'd say "Array" isnt a very common data type in F#, because arrays, are quite imperative, or at least, "updating" an array isnt common, you can go.

xs[23] <- "new value"

and assign new values into an existing array, and it will do it in exactly the same way as C# imperative code, this isnt a persistent operation.

if you want to append

let newArray = Array.append [| 1 |] [| 2 |] this is persistent.

I would say the most common 'sequential' data type that you 'update' would be list, which works in the same way as the textbook (e.g. like Haskell) list.

(in my code) in FP, you strangely rarely want to "assign" a value in a fixed position in a list,....not sure why, its like hitting a screw with a hammer.

Saying that there are data types that are part of F# (and not dotnet or some C# lib) that be better comparisons e.g.

F# sets and maps are implemented as immutable AVL trees

(I don't pretend to know what an AVL tree is)

The record syntax is part of the F# syntax, but as data types in F# are dotnet (C#) classes, I imagine an object is cloned and imperatively changed, and impure bubble that evaporates as noone can see inside.

More esoteric stuff, like trees and zipper you would tend to roll yourself or if you can use 3rd party libraries

e.g. https://fsprojects.github.io/FSharpx.Collections/reference/fsharpx-collections-experimental.html

A lot of F# code will access C# libraries (be that microsoft or 3rd parties), which are almost always use ephemeral data types.

but I think comparing XPath etc to F# at this level is probably not helpful, X* (in my head) works at a higher level of abstraction, if I'm transforming some XML or JSON in F#, ok, it may work 10x quicker, but there are times where I have to manually optimise with "maps" (dictionaries) much more than I would do so in XSLT, and the underlying code libraries are C# ones, so updating an element in an XML is imperative (usually), or I would clone the whole structure in a manner very similar to the XSLT shallow copy pattern (thus my comment to MK about the need for a zipper at all).

ChristianGruen commented 10 months ago

@MarkNicholls Thanks for the interesting insights.

I would say the most common 'sequential' data type that you 'update' would be list, which works in the same way as the textbook (e.g. like Haskell) list.

I see. The (immutable) arrays in XPath are actually more like lists in Haskell/F#, as they have a dynamic size and usually don’t provide lookups in constant time (except maybe in trivial implementations). The most typical array updates are probably appends and prepends. I was not actively involved in the specification of version 3.1, so I’m not sure why functions like array:put were added to the spec, as there are no such functions for sequences (presumably exactly because the operations of fixed-size arrays served as inspiration).

but I think comparing XPath etc to F# at this level is probably not helpful, X* (in my head) works at a higher level of abstraction

It may be more helpful to compare XQuery (and certainly XSLT, which is not my area of expertise) to F#. If all the application/business logic of a (web) application is written in that languages (and based on EXPath and vendor-specific extension modules), it can serve as a full replacement for Java or Python code, with the tight coupling for structured data types as USP. Consequently, we believe that data structures should be as light-weight and low-level as possible (but still immutable).

michaelhkay commented 10 months ago

I was not actively involved in the specification of version 3.1, so I’m not sure why functions like array:put were added to the spec, as there are no such functions for sequences (presumably exactly because the operations of fixed-size arrays served as inspiration).

I've just in the last couple of days seen people advising how to change one item in a sequence using remove() and insert-before(). We added array:put() in response to a similar use case, without recognising that it might also be needed for sequences.

Saxon's default implementation for arrays is simply a Java ArrayList. Most arrays (e.g. those created by parsing JSON) never get modified; if an operation like append() or remove() or put() occurs, we make a copy into a "persistent" data structure that offers faster update but slower retrieval access. The persistent data structure we use is a "ZenoChain" which follows the same principles as the ZenoString described in my Balisage 2022 paper: it has similar performance characteristics to a Finger Tree but is much more space-efficient, except under the unusual scenario where there are lots of random inserts and removals in the middle of the array.

Similarly Saxon's default implementation for maps is based on a Java HashMap, and we switch to a functional data structure (a hash trie) the first time an update takes place. There's an optimised implementation for maps where all the keys are xs:string instances, which you get when parsing JSON, and for many literal maps, e.g. values of options parameters. We don't currently have an optimized implementation for records, but it's quite feasible.

ChristianGruen commented 10 months ago

I've just in the last couple of days seen people advising how to change one item in a sequence using remove() and insert-before(). We added array:put() in response to a similar use case, without recognising that it might also be needed for sequences.

Good to know. I’ve also seen FLWOR expressions for this use case, à la…

for $i at $p in $VALUE
return if ($p = $POS) then $NEW else $i

If fn:items-at was renamed to fn:get (#872), fn:put would be an obvious choice… If it hadn’t already been taken by XQuery Update (and both function names may be too short to be intuitive). Maybe fn:get-item(s) and fn:put-item(s).

Saxon's default implementation for arrays is simply a Java ArrayList. Most arrays (e.g. those created by parsing JSON) never get modified; if an operation like append() or remove() or put() occurs, we make a copy into a "persistent" data structure that offers faster update but slower retrieval access.

We completely discarded the ArrayList in favor of the Finger Tree, which proved to be surprisingly efficient even for the creating new lists (great kudos go out to Leonard Wörteler, who wrote the implementation). However, to save memory we still store sequences of unmodified primitive values (integers, booleans, etc.) in arrays.

dnovatchev commented 10 months ago

I would say the most common 'sequential' data type that you 'update' would be list, which works in the same way as the textbook (e.g. like Haskell) list.

Arrays in XPath are functionally exactly like lists in Haskell (and in Python but immutable). I don't know why they didn't call them lists.

This can help us have a look at Haskell's and Python's functions on lists and see what we have missed so far in XPath.

michaelhkay commented 10 months ago

I don't know why they didn't call them lists.

Because JSON doesn't.

dnovatchev commented 10 months ago

I don't know why they didn't call them lists.

Because JSON doesn't.

Thanks, good to know.

Anyway, for me the primary, most expressive and not butchered sequential XPath datatype is the array, and it seems to me that the language could be built upon only arrays and not having the sequence data type - the same way Haskell doesn't have our sequence datatype and people have been living happily with this for decades.

We cannot remove the sequence datatype due to historical (legacy compatibility and support) reasons, but with a little bit of discipline one can start using only arrays, and if people do so, then we could say "Bye" to the sequence in a few years' time.

If this sounds like Sci-Fi to you, then maybe it's time for a new language - we have the precedent of TypeScript as improvement of Javascript.

michaelhkay commented 10 months ago

The sequence type (with its origins in the XPath 1.0 node-set) has a very good fit to XML semantics. With an expression like ./author, it's very convenient to treat a singleton the same as a list of length one. In XML, if the schema allows one author element child then it's very easy to extend it compatibly to allow two. With JSON, you can't do that, and it causes real problems all the time.

And the idea that a set of one thing is distinct from the thing itself makes sense to mathematicians, but not to ordinary mortals.

dnovatchev commented 10 months ago

And the idea that a set of one thing is distinct from the thing itself makes sense to mathematicians, but not to ordinary mortals.

Seems quite a divisive argument...

Despite anecdotal evidence that "smarter" people seem to live longer (Isaac Newton died in 1727 aged 84, the philosopher-mathematician Bertrand Russell lived to 97, while Nobel Prize-winning neurobiologist Rita Levi-Montalcini died in 2012 aged 103), this is not a statistically-significant or a convincing argument for dividing the people into "mere mortals" and other (including "mathematicians") 😄

Imagine what would happen at school if they took the opinion of the majority of kids about Maths and cut its content severely.

And if they asked me when I was a kid, I would definitely vote against drawing and other forms of art being studied at school.. Good that they didn't ... 😄

That something seems difficult should not be a justification for not trying to master it.

michaelhkay commented 10 months ago

Usability engineering is all about understanding the world view and mental models of your stakeholders; acknowledging that your target users are not homogenous is inclusive rather than divisive.

dnovatchev commented 10 months ago

Usability engineering is all about understanding the world view and mental models of your stakeholders; acknowledging that your target users are not homogenous is inclusive rather than divisive.

Yes. And I guess that Microsoft did a good usability engineering based on which they have these extension methods in the Enumerable class:

Certainly they knew that all of these methods can be expressed in other ways (with much longer, difficult to read and understand, and significantly error-prone expressions).

Good usability engineering is to recognize the functionality that is used and needed most often and to provide the shortest and clearest way to express it, like you are just saying it in plain English.

People, within this issue you are worried about " p e r f o r m a n c e " ... whereas you have to be orders of magnitude more worried about the user, about providing him with the most simple and easy ways to express the most frequent/needed functionality.

Why do we care? Because we have got accustomed and are taking for granted the good user-engineering work that Microsoft did, and thus is stikes us as something unexplainable why these goodies are missing here.

ChristianGruen commented 10 months ago

you have to be orders of magnitude more worried about the user

We speak to users of the language nearly every day, both novices and experts.

Feel free to create a separate issue on usability, and to make concrete suggestions on what to improve.

MarkNicholls commented 10 months ago

declare record person( name as xs:string, full := fn { string-join((?title, ?name), ' ') } );

@ChristianGruen

so going back to this, and the original question (I've removed "title")

I may have caught up, you are saying, imagine a query that returns 1000 of these? but the only difference between the records is the name?

but you effectively have 1000 copies of "full"?

"full" will probably never change (i.e. like a method on in OO language),

If there were 10 "methods" then this situation would be worse.

Is this the crux of the question?

ChristianGruen commented 10 months ago

Is this the crux of the question?

@MarkNicholls Exactly. Also with immutable data structures, we would save both memory and runtime if a record contained only data (or references to data) that are expected to change in modified copies.

michaelhkay commented 10 months ago

Well, at worst you'll have 1000 references to the same function (each being an entry in a map). But it doesn't require that much ingenuity to come up with a representation that reduces the space overhead further.

I suspect Christian was more concerned with the time overhead of looking up a string key like "full" dynamically in a hash table, rather than reducing it statically to an integer offset in a fixed record structure. But Javascript seems to manage! There are all sorts of techniques available and I think it would be a mistake to worry too much about potential implementation strategies when designing the language. The history of compiler technology is one of compiler writers rising to the challenge of finding efficient implementation strategies for powerful language features, often long after the language features were first introduced.

ChristianGruen commented 10 months ago

Well, at worst you'll have 1000 references to the same function (each being an entry in a map).

…or much more; it’s easy to imagine use cases with millions of record instances, just as many XML documents have millions of nodes.

I suspect Christian was more concerned with the time overhead of looking up a string key like "full" dynamically in a hash table, rather than reducing it statically to an integer offset in a fixed record structure.

It’s both. Indeed, the initial comment of this thread is mostly about what Mark stated; it’s the minimization of memory and runtime for data that never changes.

I (tend to, but) don’t want to get evangelistic. To date, XML languages have not been particularly known for their performance, even though the community has produced some impressive software solutions. Perhaps we must simply try harder to change that in the future.

MarkNicholls commented 10 months ago

I think it would be a mistake to worry too much about potential implementation strategies when designing the language

this is why I think it may be relevant, at least for a mental trip around the block about it, because it may be that the design creates an issue that doesnt really need to exist.

for example.....(I'm not suggesting this as a solution)

In Haskell you wouldnt really have this problem (well you would if you embed you functions in the data structure), you emded the functions in a seperate structure (the type class) and this is instantiated once and passed around using magic.

i.e. The solution is in the design of the language.

Even if you don't have/use typeclasses, Scala implicits or even just good old fashioned dictionary passing explicitly seperate the functions from the data structure, because in reality, the functions dont "change".

again, the solution is in the design of the language.

In an OO language, the methods are immutable (of course you can make them explicitly part of the data type like 'data'), OOPs distinguish between stuff that changes and stuff that doesnt, so the vtable holds the pointers to the methods and the object has a single reference to the vtable.

as above....

I'm almost completely ignorant of javascript.

?

So in terms of design, it MAY be desirable to state what parts of the record change, and what parts cant, and that may give benefits in terms of performance.

BUT....saying that, that doesn't preclude kicking the metaphorical can down the road, a record syntax that doesnt do this can be easily extended to do this at a later date IF it is genuinely proven to be desired, so in that sense, I'd tend to suggest KISS, its probably a premature optimisation without any empirical evidence, and there may be ways to fix any real issues when they arise in the implementation without tinkering with the spec, but be aware that this line of demarkation between mutable and immutable data does usually exist in the spec, and performance probably benefits from it.

MarkNicholls commented 10 months ago

XML languages have not been particularly known for their performance, even though the community has produced some impressive software solutions. Perhaps we must simply try harder to change that in the future.

In my experience that's more of a prejudice (that I shared) than a reality and largely built on the performance of "older" v1.0 implementations.

I have done informal benchmarking to discover that Saxon (but not some other) XSLT engines were considerable faster than hand crafted python code (actually it also surprised me just how slow python was), ok its not quite up there with kotlin/C#/java/F#, but I was still quite impressed.

I think the principle problem in terms of that perception is the poor performance of some implementations tends to skew the perception.....not saying that its not a worthy endeavour, just sadly perceptions arent a fair judgement of the reality.

MarkNicholls commented 10 months ago

declare record person( name as xs:string, title := (), const full := fn { string-join((?title, ?name), ' ') } );

would do it? but does it actually have any real benefit in terms of performance? I can't say....I'm just a user.

ChristianGruen commented 10 months ago

In my experience that's more of a prejudice (that I shared) than a reality and largely built on the performance of "older" v1.0 implementations.

I would claim it's both prejudice and reality: The better our implementations get, the more seriously they are compared with faster languages. There are always people, though, who’ve never tried to write ambitious projects with X* at all, and those like to stick with what they’ve heard or experienced 20 years ago.

declare record person(
  name as xs:string,
  title := (),
  const full := fn { string-join((?title, ?name), ' ') }
);

I had similar thoughts: The definition of a default implementation in the record declaration could already be an indication that a value needn’t be materialized in the record. An additional final, const or static keyword could indicate that the value must not be touched by e.g. map:put or map:remove.

MarkNicholls commented 10 months ago

@ChristianGruen

I take the point about reality, as I say I don't think its without merit, just that politics/opinion sometimes distorts the notion of facts/reality (hmmm...maybe I'll stop there).

ah...I DIDNT even see the fact that there was a default implementation there, is that currently valid syntax?

ok, I get it, I don't think I'd want a default implementation to mean it was "immutable", I may want to change a "method" without resorting to all that GOF OO gymnastics, so I'd prefer an expliclt "const" (or whatever) IF the implementors think is has some potential positive benefit, because in the context of an immutable world it has little value other than a performance optimisation.

C# has default implementations on an interface (though ive never used it), it can (I think) statically be overriden, but at run time all methods are immutable, so its a slightly different context....can't remember what scala does...I think traits may work something like this, but again, in a statically compiled language it means something different (same with Haskell and default implementations on typeclasses).

I don't know what typescript/javascript does...may be interesting to know.

but this is all predicated on...

I can't answer either.

P.S.

never been a fan of "final" or "static", "final" just seems an odd adjective to use, and "static" means all sort of weird stuff in all sorts of languages, so its too overloaded.

P.P.S.

there is an implementation mechanism that could do this without changing the syntax, you could assume all fields are "const" but when you find out they arent, you flip the object representation into a "non const" world, i.e. you effectively "override" the value from the record in the "object", so each field can exist in 2 states either pointing at the record implementation or a local override.

I don't think this is especially new, I think (can't think of what it was) some languages support this "object" style inheritance, though probably for different reasons, and maybe in a different style, but it may be worth looking up.

The question then is do the extra mechanics negate any performance (speed rather than space) benefits.

If the cost is minimal, then you leave it as an implementation optimisation

MarkNicholls commented 10 months ago

thinking about this a bit more.

it doesn't require that much ingenuity to come up with a representation that reduces the space overhead further.

you'd hold a reference to the record, and when someone looks up something, the code looks it up in the local map, if its not there it refers to the record definition.

If "const" exists in the language, you'll know this in advance, and so you save 1 lookup, if it doesnt, then you potentially have 2 lookups, rather than 1.

It doesnt feels like its worth it to me, you may as well optimise it in the implementation (if required), and leave it out the spec, especially as it doesnt really serve any purpose to the developer except a potential minor time improvement i.e. no one will ever use it.

michaelhkay commented 10 months ago

I'm almost completely ignorant of javascript.

I don't see how we can have this discussion without understanding how Javascript solves the problem, since Javascript is what our data model is closest to.

michaelhkay commented 10 months ago

potential minor time improvement i.e. no one will ever use it.

Indeed, I tend to work on the assumption that when you add features to the spec that are designed to enable users to tweak performance, then (a) 90% of users can ignore it because they don't have a performance problem, (b) most of the remainder don't know the feature exists, and (c) of those who do know the feature exists, half will misuse it.

unordered() is an obvious example.

dnovatchev commented 10 months ago

The question then is do the extra mechanics negate any performance (speed rather than space) benefits.

If the cost is minimal, then you leave it as an implementation optimisation

Exactly!

To quote Donald Knuth: "Premature optimization is the root of all evil..."

It is especially harmful to bring up optimization into the early design stage -- that is, into what we are supposed to be doing.

Also, let us not rediscover the wheel. Others have done this successfully - many times. Are we incapable just to copy their wisdom?

dnovatchev commented 10 months ago

It doesnt feels like its worth it to me, you may as well optimise it in the implementation (if required), and leave it out the spec, especially as it doesnt really serve any purpose to the developer except a potential minor time improvement i.e. no one will ever use it.

Exactly!

It is especially harmful to bring up optimization into the early design stage -- that is, into what we are supposed to be doing.

Also, let us not rediscover the wheel. Others have done this successfully - many times. Are we incapable just to copy their wisdom?

ChristianGruen commented 10 months ago

ah...I DIDNT even see the fact that there was a default implementation there, is that currently valid syntax?

@MarkNicholls Please note that my initial comment refers to what the current specification defined. Both the record constructor and the declaration of optional default values are not part of the specification yet. What exists is a so-called record test, which can be specified as a parameter or a return type. For example, when the following code is called…

let $f := function() as record(name as xs:string, title? as xs:string) {
  map { 'name': 'Angelopoulos' }
}
return ...

…we’ll know that the map returned by invoking the function will match the supplied record test.

Existing GitHub issues and upcoming pull requests build on that:

Lots of different ideas, proposals, and ideas in this thread. Thanks everyone for your thoughts; I didn’t expect so much feedback. I’ve deliberately assigned the »Discussion« label to this issue to indicate that this issue probably won’t result in a single concrete proposal, but will rather serve as an inspiration for upcoming issues.

dnovatchev commented 10 months ago

In my experience that's more of a prejudice (that I shared) than a reality and largely built on the performance of "older" v1.0 implementations.

I would claim it's both prejudice and reality: The better our implementations get, the more seriously they are compared with faster languages. There are always people, though, who’ve never tried to write ambitious projects with X* at all, and those like to stick with what they’ve heard or experienced 20 years ago.

declare record person(
  name as xs:string,
  title := (),
  const full := fn { string-join((?title, ?name), ' ') }
);

I had similar thoughts: The definition of a default implementation in the record declaration could already be an indication that a value needn’t be materialized in the record. An additional final, const or static keyword could indicate that the value must not be touched by e.g. map:put or map:remove.

This seems like an implementation detail that must stay hidden from the user's surface. Of course any sensible implementation will use as value a reference and not the full code of the function. But should the end-user be involved with this detail? Absolutely not! The user needs to be able to concentrate on the original problem being solved - not on multitude of unrelated implementation details that clutter and obscure this.

Anyway, if someone wants to change a function-value with another function-value of the same type, they clearly know what they are doing and the cost of it. They clearly know that they are creating a completely new (another) instance and that the original instance is never updated, because in XPath everything is immutable.

Do not try to support a drug user after his addiction has become a fact and when it is too-late - better try to prevent this from the very start - completely.

dnovatchev commented 10 months ago
  • Yet another step could be the addition of keywords like final, which could prevent users from removing functions from a record (making the optimizer’s life easier shouldn’t be the main motivation).

Adding const or final into an immutable language can only bring confusion...

Users don't need this.

I don't need this and do not see what it has to do with the language I am using to solve any particular problem.

MarkNicholls commented 10 months ago

Anyway, if someone wants to change a function-value with another function-value of the same type, they clearly know what they are doing and the cost of it.

The issue is more the cost of allowing them to do this when they don't need to, not the cost of them actually doing it and trying to stop them. Const tells the environment to not bother allowing them when they dont, and thus it can optimise on that basis.

The user needs to be able to concentrate on the original problem being solved - not on multitude of unrelated implementation details that clutter and obscure this. I agree in spirit, except we accept "key" as "an implementation detail" to implement algorithm that using predicates are impractical.

Non functional aspects can't always be easily brushed under the carpet.

(though i suspect this one can).