qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Function families #757

Closed michaelhkay closed 6 months ago

michaelhkay commented 11 months ago

We talked on the call today about the tension between defining multiple simple functions focussed on one task, and a small number of omnibus functions that have many different options.

I think we would all agree that multiple simple functions would be the better choice except for the problem that they all end up going into a single global namespace. So the question becomes, how can we better partition the name-space (using the term deliberately with a hyphen).

We're reluctant to use the namespace mechanism to partition our function library because namespaces are cumbersome and clutter the code with lots of boilerplate; declaring namespaces for binding function libraries also has side-effects for example on the semantics of element constructors.

One approach would be to build on the idea that @dnovatchev presented of using maps containing anonymous functions, so for example csv()?parse() would first call fn:csv() to load a family (or library) of functions, of which one is then selected for execution. This works, but I don't think it's a perfect solution; for example static analysis becomes a lot more difficult, and we don't get the benefits of default parameters and keyword arguments.

Most languages use hierarchic names with "." as a separator. Although XML names allow "." as a regular character, none of our built-in function names currently use it as such. So it would be entirely possible to adopt a convention where names like csv.parse() etc are used to name functions in a function family referred to as "csv". This wouldn't by itself require any language changes.

But if we adopted this convention, we could build on it to provide usability tweaks that make a large function library easier to manage. For example, we could put the math functions into the fn namespace with names like fn:math.sin(x), and then provide a way of binding a namespace prefix to a subtree of the fn namespace, so math:sin becomes a synonym for fn:math.sin(). The immediate benefit is that the namespace prefix doesn't need to be declared unless people want to use it. We could also then consider defining an algorithm for searching the fn namespace for abbreviated names such as sin(x), perhaps with some form of "import functions" declaration that says which subtrees of the fn namespace are to be searched.

dnovatchev commented 11 months ago

Most languages use hierarchic names with "." as a separator. Although XML names allow "." as a regular character, none of our built-in function names currently use it as such. So it would be entirely possible to adopt a convention where names like csv.parse() etc are used to name functions in a function family referred to as "csv". This wouldn't by itself require any language changes.

But if we adopted this convention, we could build on it to provide usability tweaks that make a large function library easier to manage. For example, we could put the math functions into the fn namespace with names like fn:math.sin(x), and then provide a way of binding a namespace prefix to a subtree of the fn namespace, so math:sin becomes a synonym for fn:math.sin(). The immediate benefit is that the namespace prefix doesn't need to be declared unless people want to use it. We could also then consider defining an algorithm for searching the fn namespace for abbreviated names such as sin(x), perhaps with some form of "import functions" declaration that says which subtrees of the fn namespace are to be searched.

This would work well in the case of standard (universally available and defined in the official specifications) functions, but (see below) is not limited only to this case.

One approach would be to build on the idea that @dnovatchev presented of using maps containing anonymous functions, so for example csv()?parse() would first call fn:csv() to load a family (or library) of functions, of which one is then selected for execution. This works, but I don't think it's a perfect solution; for example static analysis becomes a lot more difficult, and we don't get the benefits of default parameters and keyword arguments.

We still haven't completely rejected the idea that dynamic functions could be defined and called using keyword arguments and default argument values.

Also, it is not correct to only mention issues for a possible strategy/solution - the benefits of using such a solution need also be accounted for and considered. Write once use everywhere, eliminate redundancy and multiple code versions for different host languages, eliminate possible errors in maintenance of versions for different hosts, eliminate manual effort in keeping different versions of the code in sync - just to name a few of the advantages of the XPath Function Libraries.

Notice that the dot-delimited name-spaces are not mutually exclusive with the idea of having XPath Function libraries. We can perfectly define an XPath function library (the map contained in a variable whose name is in) a namespace like: problem-area , and have also a more specific function-library in the name-space: sub-area.problem-area. This can be done immediately at present for any XPath 3.1 function library, and upon loading the library, the dot-delimited namespace can be the namespace of the variable that will contain the loaded library.

Another way to do this would be to use the resource uri (full URL-path or file-path) as the function-library namespace. For example one can have XPath Function Libraries that are provided using URLs such as: collection, generators.collection, map.collection, array.collection, sequence.collection, etc.

Yet another possibility is to have the function names in a single, big library, compliant with the dot-separated name convention and to ask the loader to load only those functions of the function-library, whose name has a specified suffix (or prefix, depending how we specify sub-families: left -->right or left <-- right), For example, get only the functions from a function library, whose prefix ends with ".array".

So, let us work on this promising idea without the need to artificially present as confronting and mutually exclusive the collection of standard functions on one side, and the many possible: diverse, decentralized, unrelated and unsynchronized XPath Function Libraries - on the other side.

michaelhkay commented 11 months ago

Thanks for the comments. The ideas in the initial issue should be taken as a sketchy concept that needs a lot of further thinking to take it forward, and some of the ideas you contribute are definitely things that I already have as not-fully-developed ideas to add to the mix.

michaelhkay commented 11 months ago

My thinking is roughly this:

We define a global function namespace, say http://qt4cg.org/functions for function names; the local names of functions in this namespace are organised in a hierarchy of function families, and use "." as a path separator.

A namespace URI can map to any node in this hierarchy.

The URI http://www.w3.org/2005/xpath-functions maps to the "sys" function family. The full name of fn:count is thus Q{http://qt4cg.org/functions}sys.count.

The URI http://www.w3.org/2005/xpath-functions/array maps to the sys.array function family. The full name of array:size is thus Q{http://qt4cg.org/functions}sys.array.size; but the function can also be referred to as array:size or fn:array.size; when fn is the default function namespace, it can be referred to as array.size, which means the function is now available without declaring the array namespace.

It now becomes much easier to add new function families. For example we can define a family of functions whose names take the form sys.csv.* and these become available to users under names such as csv.parse() without the overhead of declaring yet another namespace. But if users want to bind a namespace to this function family, they can use xmlns:csv="http://qt4cg.org/functions#sys.csv".

Outwith the "sys" subtree, all names in the global function namespace are available to users.

The next stage is to define some mechanism that allows abbreviated function names, in the same way that for example Java allows java.util.Map to be referred to as Map provided that the package (=function family) has been imported.

All problems in computer science can be solved by adding another level of indirection.

dnovatchev commented 11 months ago

I do not recommend having huge libraries, but if such already do exist, we can represent them as a (deeply) nested map structure, where the leaves contain functions.

let $f := map{
      "sys": map{
                        "fn": map{
                                        "count": fn:count#1,
                                       (:  All fn function name-value pairs here :)
                                       },
                        "math": map{
                                              "sin": math:sin#1,
                                               (:  All math function name-value pairs here :)
                                             },
                        "array": map{
                                              "append": array:append#2,
                                               (:  All array function name-value pairs here :) 
                                            },
                         "map":map{ 
                                             "put":map:put#3,
                                               (:  All map function name-value pairs here :) 

                                            },
                       (:    etc.     :)
                 },
       (:    etc.     :)
   }

Then, one can select the sin (without the need to know and/or declare any namespaces) function by navigating:

$f?sys?math?sin

Or, in case we have a deep-map selection operator, then simply:

$f??sin

Or maybe even with:

??sin

michaelhkay commented 11 months ago

I know that we can organise dynamic function items into this kind of structure, but I'm looking for a better way to organise the names of static function definitions, which I think is a different problem.

rhdunn commented 11 months ago

Using . in the names introduces an inconsistency in the naming convention of the functions in the spec. The function names are all sentence case, so I would prefer to keep that convention.

In e.g. MarkLogic, the functions in a group are named [prefix]:[object-scope]-[function] -- this is similar to how the op: functions are named.

The names in the FO spec are named to be readable in an SVO (subject verb object) language, so you have e.g. hours-from-time instead of time-get-hours and parse-json. I'd prefer to keep this, otherwise we would have multiple naming conventions.

Having some logic to make scoped names work like the fn:math.sin example would need consideration for how it would work in the language for users to make use of and be backward compatible. -- I don't want another internal feature that is specific to processors and not available to users of the language.

If we do have that feature, I would prefer the standard to keep the snake case so we would have fn:math-sin. Although I'm not convinced about adding aliases of functions we have into the fn namespace.

michaelhkay commented 11 months ago

I don't see it so much as an "inconsistency" as a new feature/convention layered on top of what we already have. We currently have a two-level hierarchy for function names (namespace + localName) and the proposal turns this into a hierarchy with any number of levels. Naming conventions for the final part of the name are unchanged; we just add more flexibility in how the name is qualified.

It's true of course that if we had had this capability from the start, we would have named things differently. But this strikes me as a way of moving forward incrementally, and tackling the problem that our current mechanisms for partitioning the name space are too coarse-grained.

rhdunn commented 11 months ago

The inconsistency is around where we would have fn:parse-json from the 3.x specs, but introduce e.g. a fn:csv.parse function in the 4.0 specs -- using fn:parse-csv would be consistent with the established FO naming conventions.

ChristianGruen commented 11 months ago

If we designed an entirely new language, I’m sure we would do some things completely different. With the given situation, I share some of Reece’s concerns:

The existing naming conventions (or what can be derived as conventions from the given function names) would conflict with the new conventions, with both approaches having advantages and drawbacks. Apart from that, we have lots of XPath 1.0 functions with (from today’s perspective) insufficient names, such as fn:contains. sys.contains or fn:sys.contains would rather make the situation worse, even if we have fallbacks to still be able to use fn:contains. We would actually need to go further and introduce reasonably named aliases, such as string.contains, and then we’d eventually end up introducing aliases for the majority of the existing function set …which hasn’t been motivated in this issue anyway, but I think would be the next logical step.

From the XQuery perspective, I don’t feel there is a need to improve things: It’s easy and intuitive to work with statically declared namespace bindings for the existing XQFO URIs (map, array, math), and it’s straightforward to declare namespaces for optional module sets. Most functions that we have added to XQFO 4.0 so far would stay in the standard name-space anyway as their scope is general enough. If we introduced functions like csv.parse, the naming scheme would differ from parse-json unless we introduce additional json.parse aliases.

Again, from the XQuery perspective, I would rather be happy to see module imports simplified (see #764).

Or is this issue mostly about XPath and the challenge to declare namespace prefixes for functions that are non in thefn namespace?

ChristianGruen commented 11 months ago

Somewhat related, Norm’s initiative to build a QT module repository: #274.

michaelhkay commented 11 months ago

The trigger for the issue was the desire to add a small family of functions for CSV handling. Neither of the two obvious approaches ((i)add several closely related functions to the fn namespace, (ii) define a new namespace) is particularly appealing. We've faced the same problem before (for example, the items-before/after/starting-with/...) family) and it would be nice to have a better solution.

ChristianGruen commented 11 months ago

The trigger for the issue was the desire to add a small family of functions for CSV handling. Neither of the two obvious approaches ((i)add several closely related functions to the fn namespace, (ii) define a new namespace) is particularly appealing. We've faced the same problem before (for example, the items-before/after/starting-with/...) family) and it would be nice to have a better solution.

Thanks. In that case, I would definitely prefer a solution that does not deviate from the existing JSON and HTML functions. Otherwise, it would just be another case of inventing something new and deliberately ignoring the current state. Importing CSV is similar enough to importing JSON and HTML, and I believe that the structure of 99% of CSV data is simple enough to be represented in, and accessed via simple and tabular XML/XDM structures (which is perfectly supported by the path and lookup expression).

dnovatchev commented 11 months ago

I know that we can organise dynamic function items into this kind of structure, but I'm looking for a better way to organise the names of static function definitions, which I think is a different problem.

These two problem are isomorphic.

And we could provide s standard (as part of FO) map that organizes all static functions into meaningful semantic hierarchies.

In fact, in the TOC of the FO document we already have grouped the functions of the fn namespace into many groups:

Providing a standard map with this structure will be a valuable tool for restricting to the exactly wanted function group, and easily remembering what functions "to do some specific task" are available.

This could be extremely useful in an IDE which prompts with the available next-levels of the function-hierarchy and displays just the function-names from the currently-reached function-group.

Thus, the grouping is already done for us and exists. The semantic function-hierarchy has been hiding in plain sight for years as part of the official spec. document. What remains is to organize the function names to reflect this existing semantic hierarchy so that the users could easily identify a function (out of hundreds) by just a few (or in many cases even a single) qualifiers.

Arithmeticus commented 11 months ago

I'm personally on the fence on this issue, and would like to hear more.

If one wished to develop this idea further, it would be quite feasible to write a small XQuery or XSLT library that reassigns and therefore remodels all standard functions in a proposed alias layer.

For those advocating a new naming layer, might I suggest an approach that is not hierarchical but rather class-based? For example, fn:parse-csv() could be aliased as both fn*:parse.csv() and fn*:csv.parse(), so that users don't have to memorize a tree hierarchy.

dnovatchev commented 11 months ago

For those advocating a new naming layer, might I suggest an approach that is not hierarchical but rather class-based? For example, fn:parse-csv() could be aliased as both fn*:parse.csv() and fn*:csv.parse(), so that users don't have to memorize a tree hierarchy.

Hierarchical is helpful for fast searching, like a restaurant menu. One can find their preferred dish in a logarithmic time.

On the other side, using classes (not structured in a hierarchy) requires you to "know the magic word(s)" and in this respect is not much different than the inflexible flat namespace-uris.

ndw commented 11 months ago

I'm genuinely sympathetic to the issue, but I'm worried that attempts to improve the situation will only make it worse.

First, we can't, practically, rename existing functions. So any solution we invent has to live along side the current set of functions. We could say that new functions would only go in the new hierarchical mechanism (whatever that is), but that feels like a very large discontinuity to force on users: fn:concat() is a function but parse-csv is only available through a map or some other mechanism? So does that mean we have to both perpetuate the old system in parallel with the new system indefinitely?

(Even if we say that fn:concat is also available through the new mechanism, that doesn't really help. I have fn:concat all over my existing code base and it's in the editor macros and templates that I use. Replacing it with something new just for consistency seems like a very low priority task. It's one of those things that seems like a good idea, but rarely seems important enough to justify the very real risk that a typo or other error when doing the change will introduce a real bug.)

Perhaps more problematic is the fact that there isn't a single, logical grouping. This is one of those categorization problems where there isn't a right answer. Do we group the JSON functions together or do we group the parsing functions together? Where does parse-json go? No matter which answer you choose, it's at least equally wrong as it is right.

And an argument that you could do both is even worse. If $fn?parse-json and $fn?json-parse are synonyms, how is that useful to me, the casual user. Which should I use? How can I tell that they're the same. It feels like an unnecessary cognitive burden to have to know that when I see $fn?parse-json in your code, that's the same as $fn?json-parse in my code.

I think the bar for doing this should be very high. We should have to persuade ourselves that we're confident that the solution is really an improvement and not just a new way for users to become confused or overwhelmed.

(I'm already suspicious that casual users will find dereferencing function items from a map so utterly confusing and un-obvious that they'll just give up. Personally, I'm fine with $fn?concat('a', 'b') but for a casual user, one who's never used HoF and possibly never used the ? operator to get values out of maps, there is some chance that that code fragment just looks as impenetrable as line nose. I have explicitly made this a parenthetical remark because I fully expect some readers to push back that I'm just wrong, that it's easy to teach casual users who very, very steadfastly refuse to self-identify as programmars that maps and higher-order functions are perfectly easy to understand. Respectfully, I disagree. But that's not really the point of this comment.)

dnovatchev commented 11 months ago

(I'm already suspicious that casual users will find dereferencing function items from a map so utterly confusing and un-obvious that they'll just give up. Personally, I'm fine with $fn?concat('a', 'b') but for a casual user, one who's never used HoF and possibly never used the ? operator to get values out of maps, there is some chance that that code fragment just looks as impenetrable as line nose. I have explicitly made this a parenthetical remark because I fully expect some readers to push back that I'm just wrong, that it's easy to teach casual users who very, very steadfastly refuse to self-identify as programmars that maps and higher-order functions are perfectly easy to understand. Respectfully, I disagree. But that's not really the point of this comment.)

I am not going to argue about this. The bad thing - having 155 (in XPath 3.1 and maybe 250 + in 4.0) different standard functions - is already done. We, the existing users can somehow still live with this.

But what about potential new users? Chances are they could be repelled by this immense unstructured and disorganized bulk of functions and find this fact leading to the moment of truth and point of no return -- a good enough justification to never ever again have even a look at these languages?

In 30-40 years when most of the current users will not be using actively these languages, can we assume that there would be new users at all, given, among other things, the standard functions disorder and lack of structure?

Thus, it seems to me that this is a significant, maybe critical, problem that we shouldn't ignore if we are trying to do what we are supposed to be doing.

Arithmeticus commented 11 months ago

Such a simplification can be done today, by creating your own XQuery or XSLT library that does nothing but re-aliases the standard functions (or locks them into one or more maps) however you like, and this can be distributed as one likes to new users. Great, go for it. I've done a bit of this myself, giving more meaningful (to me anyway) aliases for functions. But why should it be put into the specs themselves, at the risk of confusion?

And good luck agreeing on a taxonomy.

dnovatchev commented 11 months ago

Such a simplification can be done today, by creating your own XQuery or XSLT library that does nothing but re-aliases the standard functions (or locks them into one or more maps) however you like, and this can be distributed as one likes to new users. Great, go for it. I've done a bit of this myself, giving more meaningful (to me anyway) aliases for functions. But why should it be put into the specs themselves, at the risk of confusion?

@Arithmeticus , Yes, but not every user will do this - in fact few will.

Those who don't (quite probably the vast majority of users) will continue to be overwhelmed by the enormity and lack of structure of the huge set of standard functions.. Thus, we have identified the problem and we can try to do something about it, or, alternatively, we could make an informed decision to ignore the problem completely. If we go for the latter (not do anything) then we will bear the responsibility for this decision.

Not all of us obviously have the same position - the mere fact that @michaelhkay submitted this issue shows that he cares a lot. I also happen to support the thinking that something can and needs to be done for solving, or at least alleviating this problem.

And good luck agreeing on a taxonomy.

Even an imperfect solution in this case is better than not having any solution at all and doing nothing.

Let us try to learn from the development practices of other products. For example, Google's Angular has a new release every 6 months and the users of a specific version are given support for 6 + 12 months (a total of 18 months).

With such active release dynamic most peoples' code would become obsolete pretty soon.

This has been avoided completely by providing the ng update functionality of the Angular CLI, which takes code using an older version of Angular and automatically updates it to the latest/current version.

Obviously, this is a general approach that is feasible with many other products, completely unrelated to Angular.

So, just the facts:

  1. Other organizations have successful solutions for even bigger version evolution and update problems.
  2. But in our case we prefer to do nothing about it,
Arithmeticus commented 11 months ago

I understand the need and am sympathetic. The question is whether it should be the job the specs to do this, or if it should be pushed to the community. If the former, is there a reason other than convenience, and does the convenience outweigh the potential clutter/confusion? (I well remember @dnovatchev 's posts on the Slack community complaining about the sprawl of the specs.)

IMO, a significant problem lies outside the remit of this community group: There is comparatively little infrastructure that has been developed for the X languages (e.g., what the Cheese Shop is to the Python PL), and so significant XSLT/XQuery function libraries (e.g., EXPath, mathling, TAN) are difficult to find, so many new users think that all that they have access to are the standard functions, and perhaps the saxon extended functions.

Put another way, if we had a thriving ecosystem where hundreds of public XSLT/XQuery libraries/packages were being actively developed, shared, and distributed, I suspect this group's inclination would be delegate the task of re-aliasing/re-packaging the standard function library to members of that ecosystem.

michaelhkay commented 11 months ago

Let me just say that I agree with everyone here! (a) Yes, there's a real problem, (b) Solving it is likely to be disruptive, at least psychologically, (c) There is indeed a real risk that any attempt at a solution makes things worse rather than better, (d) if we're confident that a change really is an improvement and that transition costs are manageable, then we should be prepared to be a bit radical.

dnovatchev commented 11 months ago

All problems in computer science can be solved by adding another level of indirection.

Except the problem of having too-many levels of indirection 😄

fidothe commented 11 months ago

Following up from last weeks QTCG call, I'm trying to summarise the bits of the discussion around fn:parse-csv and friends that had wider implications and add commentary where I think I ought:

I'll deal with the first two in comments on this issue.

Function proliferation

This has already helpfully been pulled into a separate issue (this one) by @michaelhkay, so I'll try and say very little here, except that I think it's worth thinking about function proliferation and the growth of thematically-linked groups of functions as different issues that are deeply entwined.

As far as proliferation goes, I think that I would observe that all our choices suck, but the status quo sucks least. The accretion of functions in the global namespace has happened, and our best point of comparison is probably PHP circa 2005. PHP's basic all-the-functions-in-the-global-namespace situation has prompted grumbling, but users have gotten on with using it, and it persists even after the addition of Objects and Classes starting in PHP 5.

I think that the solution for discovering and making use of a large set of standard functions is good documentation, which is true whether we have a hierarchical structure we can reflect in documentation, or a flat structure we can corral with documentation.

There is a real tension between writing documentation that just serves the needs of users, and documentation that serves as a specification for implementers, and I think that currently ease of use of the spec as documentation for regular XPath end users suffers as a result of that, and resolving that is very hard.

fidothe commented 11 months ago

Following on the themes from the QTCG call on 2023-10-17:

Grouping related functions

This is the other half of what this issue talks about, and what @christiangruen is talking about in #748. There are two things to cover, from our discussion and the thread here: naming conventions and name hierarchies.

@rhdunn talks about the subject-verb-object approach to naming in https://github.com/qt4cg/qtspecs/issues/757#issuecomment-1770265891, contrasting it with other approaches that put categorisations and grouping first in the name, providing a prefix sort of alphanumeric grouping. The SVO approach (the example was hours-from-time, whose cousin seconds-from-time) does you no favours when grouping by alphanumeric sorting, but can communicate intent more clearly.

It seems like this SVO naming philosophy has not been universally applied, but has been used when it increases clarity, rather than obscuring through verbosity. (What would a strict SVO name for fn:insert-before be, and would it add clarity? fn:item-inserted-before-other-item-in-sequence?) The parse functions have used a more mixed approach, with fn:parse-json and fn:parse-html not really saying what sort of thing they'll parse the whatever-it-is into, but that not really being a problem because of the possibilities being initially quite restricted. JSON parsing introduced a sort of flipped version of SVO with fn:json-to-xml, a sort of time-to-hours. This makes it clear what it does, but suffers the same problem of related functions being unrelated in an alphanumerically sorted list.

@dnovatchev, in https://github.com/qt4cg/qtspecs/issues/757#issuecomment-1769723927, calls out the one existing tool we have for hierarchical grouping, namely higher-order-functions and maps.

XML Namespaces as a grouping tool suffer from being both (a) non-hierarchical and (b) a source of significant processing overhead.

Convention-based approaches like using a . as a convention for hierarchy within function names provides some benefits, but no actual hierarchy, so use beyond a top-level grouping a la map:* and array:* doesn't help with allowing short, easy, names in a specific context the way that traditional hierarchical namespaces do.

I think that we're in a situation where constraints of historical praxis and precedent, use, and context (as a language embedded in others) mean that there aren't any great options. (See also @ndw above: https://github.com/qt4cg/qtspecs/issues/757#issuecomment-1772357727)I also think that the problem of discoverability can be attacked outside of the code. Better user-focussed documentation overviews would go a long way to making it more obvious what the possibilities are and what functions there are for specific tasks. The spec is constrained by needing to be useful to implementers, which often makes it harder to approach as a user.

Separate documentation for functions that was easily searchable, allowed grouping of related functions, and was split across multiple pages could be be built from the existing sources with some extra metadata for categorisation and with a fairly small amount of extra text for larger grouping pages (like a page giving a brief overview of a group of functions like the map:* set). Being able to generate 'Docsets', the format used by Dash and other tools could, I think, provide a big boost to how easy it was for users to consume the function docs. If this were possible, and maintainable, in a reasonable manner, then it might go a long way to allowing us to provide user-focussed documentation without neglecting implementers.

michaelhkay commented 6 months ago

Another stimulating discussion that fizzled out without any clear proposal for a concrete way forward; so I think it's time to close it.

ndw commented 6 months ago

The CG agreed to close this issue without further action at meeting 070.