Parse functions: consistency

ChristianGruen commented 1 year ago

The functions for parsing input have been defined by different people, and the current state is quite inconsistent:

Function	Parameters
`fn:parse-xml`	`$value as xs:string?`
`fn:doc`	`$href as xs:string?`
`fn:parse-json`	`$value as xs:string?, $options as map(*)`
`fn:json-doc`	`$href as xs:string?, $options as map(*)`
`fn:parse-html`	`$html as union(xs:string, xs:hexBinary, xs:base64Binary)?, $options as map(*)`
`fn:parse-csv`	`$csv as xs:string?, $options as map(*)`

I believe there’s some need to unify the functions, and we could at least:

introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and
restrict the type of the input parameter of fn:parse-XYZ to xs:string? and always name it $value.

And I wonder if we should tag all fn:XYZ-doc functions as ·nondeterministic· (if it’s not too late)?

dnovatchev commented 1 year ago

I believe there’s some need to unify the functions, and we could at least:

introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and

restrict the type of the input parameter of fn:parse-XYZ to xs:string? and alwys name it $value.

Isn't there also fn:parse-xml-fragment ? So, shall we have two groups of parsing functions: one for docs and one for fragments? Aren't docs also kind of fragments themselves?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

ChristianGruen commented 1 year ago

Isn't there also fn:parse-cml-fragment ?

Yes, there are various functions that I didn’t list here, including fn:json-to-xml and the additional CSV functions. I’m not sure if we need a dedicated fn:doc-fragments function?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

I agree, but this would conflict with the current conventions for naming the functions in the spec. I’ve forgotten where the semantics had been specified; simply spoken, atomic parameters are called $values.

benibela commented 1 year ago

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

ChristianGruen commented 1 year ago

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

Probably too late, as we cannot change the behavior of existing functions. We could introduce an options parameter to fn:doc, but it will be difficult to do justice to everyone, as the exact behavior of the function depends a lot on the implementation (for example, the referenced input can be stored in a database or in the file system).

However, we could introduce an fn:xml-doc function with much stricter semantics. Possible options could be:

encoding
strip-whitespaces
strip-namespaces
parse-dtd
parse-xinclude
catalog
…

(would be topic for another issue)

michaelhkay commented 1 year ago

Parsing functions in general

There were a lot of questions regarding the scope and naming of parsing functions, and the approaches that had been, or could potentially be, taken.

The two scope approaches were, broadly, several single-purpose/single-output-format functions, or one multi-purpose function whose output format was controlled with an option passed in an options parameter map.

There were specific questions about CSV and why there were two functions proposed that had XDM output instead of one.

fn:parse-csv, as proposed, produced very basic output that could be used to build more complex processing on, while fn:csv-to-xdm and fn:csv-to-xml produced a more generalised, but richer, output that could be processed immediately.

With fn:parse-json, fn:parse-html, and fn:parse-xml, the parse-* function returns the immediately useful output.

This confusion suggests to me that if a new data-format function has functions to support consuming it added they should add a parse-* format that produces immediately useful output, and if the precedent established by parse-json and json-* is followed, extra functions should prefix their name with the format, following the format-verb-output naming structure used by json-to-xml where possible. (I am biased in favour of more functions with limited scope over fewer functions that do more...)

Input sources

There was discussion about what input parse functions should be able to accept. json-doc acts almost, but not exactly, like parse-json(unparsed-text('uri.json')), which was offered as both a potential convention to follow, and a confusion/source of proliferation.

Since that discussion, further discussion about handling binary data, from the proposal for fn:unparsed-binary in #557, has happened there and in this issue's comments.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

ChristianGruen commented 1 year ago

Thanks, Matt for the summary.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

Interestingly, I had similar thoughts in the past: It seemed simple enough to me to combine fn:unparsed-text/file:read-text and file:read-binary with the subsequent parse function to convert heterogeneous input to XML. I also agree that it should be up to the implementation to stream input between functions whenever possible. It was only because of repeated user requests that we added the convenience functions json:doc, csv:doc, html:doc, etc. in BaseX, and I imagine there could have been similar reasons for introducing fn:json-doc (I was not involved).

michaelhkay commented 1 year ago

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

fidothe commented 1 year ago

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

In my unchained-by-reality wondering, I imagine an unparsed-text() that returns a function. That function takes an argument which specifies what to do with non-XML characters. parse-json asks for json-style escaping, parse-csv asks for something else...

There's a 2-argument form that returns the text, with the second argument specifying what to do with non-xml characters, allowing someone to skip the indirection...

qt4cg / qtspecs

Parse functions: consistency #748

Parsing functions in general

Input sources