Open ChristianGruen opened 1 year ago
I believe there’s some need to unify the functions, and we could at least:
- introduce a
fn:XYZ-doc($href, $options)
function for each input format (with at least oneencoding
option), and- restrict the type of the input parameter of
fn:parse-XYZ
toxs:string?
and alwys name it$value
.
Isn't there also fn:parse-xml-fragment ? So, shall we have two groups of parsing functions: one for docs and one for fragments? Aren't docs also kind of fragments themselves?
As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,
Isn't there also fn:parse-cml-fragment ?
Yes, there are various functions that I didn’t list here, including fn:json-to-xml
and the additional CSV functions. I’m not sure if we need a dedicated fn:doc-fragments
function?
As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,
I agree, but this would conflict with the current conventions for naming the functions in the spec. I’ve forgotten where the semantics had been specified; simply spoken, atomic parameters are called $values
.
Could rename doc
to xml-doc
, and add a new function doc
that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)
Could rename
doc
toxml-doc
, and add a new functiondoc
that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)
Probably too late, as we cannot change the behavior of existing functions. We could introduce an options parameter to fn:doc
, but it will be difficult to do justice to everyone, as the exact behavior of the function depends a lot on the implementation (for example, the referenced input can be stored in a database or in the file system).
However, we could introduce an fn:xml-doc
function with much stricter semantics. Possible options could be:
encoding
strip-whitespaces
strip-namespaces
parse-dtd
parse-xinclude
catalog
(would be topic for another issue)
See also issue #490
The rationale for allowing fn:parse-html
to take binary objects is so that it can use the HTML encoding detection/conversion logic, and be compatible with content sent as binary in which case the user does not need to implement their own encoding detection/decoding logic.
The rationale for allowing
fn:parse-html
to take binary objects is so that it can use the HTML encoding detection/conversion logic, and be compatible with content sent as binary in which case the user does not need to implement their own encoding detection/decoding logic.
Hi Reece, I agree that’s a good idea. I believe it would be similary relevant for XML input, so I think we should either restrict binary input to a new fn:html-doc
function or also allow binary items for fn:parse-xml
and (ideally) the other parse functions.
My reseration on restricting enoding to a fn:*-doc
function is that binary/encoded text could come from other sources -- network request, zipped/compressed files, etc.
Allowing binary items on other parse functions would be useful for a similar reason.
As we’ve currently no standard function that allows us to read binary contents (which could then be processed with fn:parse-XYZ
), I’ve just I‘ve just updated #557.
Following on from the QTCG meeting of 2023-10-17, I've tried to summarise the discussion about parse-csv
that ranged much wider. Part of that was discussion of parse functions in general:
There were a lot of questions regarding the scope and naming of parsing functions, and the approaches that had been, or could potentially be, taken.
The two scope approaches were, broadly, several single-purpose/single-output-format functions, or one multi-purpose function whose output format was controlled with an option passed in an options parameter map.
There were specific questions about CSV and why there were two functions proposed that had XDM output instead of one.
fn:parse-csv
, as proposed, produced very basic output that could be used to build more complex processing on, while fn:csv-to-xdm
and fn:csv-to-xml
produced a more generalised, but richer, output that could be processed immediately.
With fn:parse-json
, fn:parse-html
, and fn:parse-xml
, the parse-*
function returns the immediately useful output.
This confusion suggests to me that if a new data-format function has functions to support consuming it added they should add a parse-*
format that produces immediately useful output, and if the precedent established by parse-json
and json-*
is followed, extra functions should prefix their name with the format, following the format-verb-output
naming structure used by json-to-xml
where possible. (I am biased in favour of more functions with limited scope over fewer functions that do more...)
There was discussion about what input parse functions should be able to accept. json-doc
acts almost, but not exactly, like parse-json(unparsed-text('uri.json'))
, which was offered as both a potential convention to follow, and a confusion/source of proliferation.
Since that discussion, further discussion about handling binary data, from the proposal for fn:unparsed-binary
in #557, has happened there and in this issue's comments.
For myself, I have been thinking about this and wondering if unparsed-text
(and unparsed-binary
or whatever fills that need) could be used as input to the parse-*
(and json-to-*
, csv-to-*
, and friends) functions. They ostensibly return xs:string
, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc
is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text
/unparsed-binary
and the parse functions in a way which doesn't require the special case shim within json-doc
.
Thanks, Matt for the summary.
For myself, I have been thinking about this and wondering if
unparsed-text
(andunparsed-binary
or whatever fills that need) could be used as input to theparse-*
(andjson-to-*
,csv-to-*
, and friends) functions. They ostensibly returnxs:string
, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently onlyjson-doc
is in a published standard, perhaps we could avoid using that as a precedent if there's a way to composeunparsed-text
/unparsed-binary
and the parse functions in a way which doesn't require the special case shim withinjson-doc
.
Interestingly, I had similar thoughts in the past: It seemed simple enough to me to combine fn:unparsed-text
/file:read-text
and file:read-binary
with the subsequent parse function to convert heterogeneous input to XML. I also agree that it should be up to the implementation to stream input between functions whenever possible. It was only because of repeated user requests that we added the convenience functions json:doc
, csv:doc
, html:doc
, etc. in BaseX, and I imagine there could have been similar reasons for introducing fn:json-doc
(I was not involved).
There was another reason for introducing json-doc()
- it was a way to bypass the inconvenient fact that unparsed-text()
rejects files containing non-XML characters.
There was another reason for introducing
json-doc()
- it was a way to bypass the inconvenient fact thatunparsed-text()
rejects files containing non-XML characters.
In my unchained-by-reality wondering, I imagine an unparsed-text()
that returns a function. That function takes an argument which specifies what to do with non-XML characters. parse-json
asks for json-style escaping, parse-csv
asks for something else...
There's a 2-argument form that returns the text, with the second argument specifying what to do with non-xml characters, allowing someone to skip the indirection...
The functions for parsing input have been defined by different people, and the current state is quite inconsistent:
fn:parse-xml
$value as xs:string?
fn:doc
$href as xs:string?
fn:parse-json
$value as xs:string?, $options as map(*)
fn:json-doc
$href as xs:string?, $options as map(*)
fn:parse-html
$html as union(xs:string, xs:hexBinary, xs:base64Binary)?, $options as map(*)
fn:parse-csv
$csv as xs:string?, $options as map(*)
I believe there’s some need to unify the functions, and we could at least:
fn:XYZ-doc($href, $options)
function for each input format (with at least oneencoding
option), andfn:parse-XYZ
toxs:string?
and always name it$value
.And I wonder if we should tag all
fn:XYZ-doc
functions as ·nondeterministic· (if it’s not too late)?