ivoa-std / DALI

DALI: Data AccessLayer Interface
Creative Commons Attribution Share Alike 4.0 International
1 stars 5 forks source link

new xtype="json" #33

Open mbtaylor opened 12 months ago

mbtaylor commented 12 months ago

I have heard a few people recently talking about putting JSON into VOTable fields. I can't immediately think of a reason why client code would need to know that the string data it's getting from a table is in JSON format, but it might do, and xtype would seem like the most appropriate place to include this information.

Zarquan commented 12 months ago

Would this be able to handle UTF8 strings embedded in the JSON ? https://www.json.org/json-en.html

"A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. "

mbtaylor commented 12 months ago

Unicode is a bit of a mess in VOTable anyway, but it's no worse for JSON than for other text content (all of the JSON syntax characters are representable in 7-bit ASCII, i.e. look the same in UTF8 and ASCII). So you could put a JSON string in either a char or unicodeChar column.

msdemlei commented 12 months ago

On Tue, Nov 14, 2023 at 09:03:18AM -0800, Zarquan wrote:

Would this be able to handle UTF8 strings embedded in the JSON ? https://www.json.org/json-en.html

"A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. "

I'm not sure whether I'd prefer that to finally allowing utf-8 in VOTable char, but sure, why not?

pdowler commented 11 months ago

Generally, the presence of an xtype means "you can parse this primitive value into an ???" where ??? is the xtype value, e.g. point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not.

Consider xtype="moc" which in the details says "using ascii moc serialisation from document ...". That could have been and maybe someday will be "using json moc format from document ...".

json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's the most appropriate way to deliver the structure efficiently.

msdemlei commented 11 months ago

On Mon, Nov 27, 2023 at 10:08:06AM -0800, Patrick Dowler wrote:

point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not.

Well, it is saying "you can turn this string into something more meaningful by passing it to a json parser". I could argue it's not much different from "turn this array into a circle" without saying whether that circle really is an error circle or an aperture or a region of interest.

Turning the string into a parsed structure, in turn, could let people directly say col[3]["measurements"][20] in a client -- of course, presuming they know what to expect in that column.

I'd hence argue that a json xtype has a fairly clear operational meaning, which is what I like about it. I readily admit that I, too, dislike much else about it, in particular the temptation for data providers to skip metadata declaration and proper relational design.

json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's

I, too, would like it better if there were some machine-readable and verifiable way to declare the structure of such a json column. I haven't looked as JSON-schema, but assuming it's not dramatically less powerful than XML schema -- would you be less opposed to this if we required people to attach a JSON schema to a json column?

Full disclosure: The reason I'm after this is that DaCHS users asked for ways to pull json/jsonb-valued columns into TAP responses, and I'm already producing xtype="json" for those, under the reasoning that it's better than betraying nothing about the inner structure of these strings.

I can't say I like it, but there are use cases for that that I cannot readily reject.

molinaro-m commented 11 months ago

-- would you be less opposed to this if we required people to attach a JSON schema to a json column? Full disclosure:

Is there a way to do that? I'd really like if we find a way to allow for a schema attached. I might have an additional constraint, that is JSON schema attached to column AND row. I know, it gets too complicated, but if it can be kept in mind while evolving this...

mbtaylor commented 11 months ago

@pdowler wrote:

Generally, the presence of an xtype means "you can parse this primitive value into an ???" where ??? is the xtype value, e.g. point. Here, that comes across as "you can parse this string into a json (object)" which isn't really saying anything about the kind of content one has to grok or not.

Consider xtype="moc" which in the details says "using ascii moc serialisation from document ...". That could have been and maybe someday will be "using json moc format from document ...".

json is a serialisation format, not an extended datatype in it's own right, so I'm against this. It doesn't say enough about the type of the value. I am not against serialising with json if that's the most appropriate way to deliver the structure efficiently.

I think you could apply all those arguments to URI and UUID, which have been introduced as xtypes in this draft. A URI doesn't say what it's the URI of. Just as you might one day want to represent a MOC in JSON, you might want to represent a MOC using its URI. So do we chuck out xtype="uri" as well?

I would say that @msdemlei 's "you can turn this [string] into something more meaningful by [using its xtype]" is the most useful criterion here.

pdowler commented 11 months ago

My point is that parsing json into an programmatically accessible object is an intermediate representation, much like a DOM with xml, and then the intermediate representation gets pulled apart to create some structured object. That object has a type that would be part of a data model and the dm implementation would likely have a class corresponding to that type. eg xtype='uri" -> java.net.URI in my code. Sure, xtype="uri" allows for many different things (ivorn and url would be more restricted), but it's common in a DM to have a field of type "uri".

Yes, you can parse a string into a JSONObject or a DOM... I just don't see that as "more meaningful" - it is still opaque unless you know what structure it is trying to convey.

msdemlei commented 11 months ago

On Thu, Nov 30, 2023 at 09:28:11AM -0800, Patrick Dowler wrote:

Yes, you can parse a string into a JSONObject or a DOM... I just don't see that as "more meaningful" - it is still opaque unless you

Well, being able to write col[3]["fluxes"][-1] in your code rather than json.loads(col[3])["fluxes"][-1] very arguably is at least a convenience. Whether that's already something in the way of "meaning" is... a question of what the meaning of meaning is, and I'd like to leave that question open.

More pragmatically, I have these people with jsonb columns in their database, and they publish these things through TAP (or, let me hedge here: at least they want to). So, what should I do?

If I just write char[], I feel I'm concealing a very* relevent fact about the column -- that it's parseable -- from the client.

Should I say "this is json" using a UCD? That feels even more wrong than an xtype.

Should I tell them to just not push out the jsonb stuff and unroll it through a view? That's an option, but I think in this case that's rather unattractive because none of what is in the JSON should be available for SQL constraints and operators, and of course I'm a coward and would like to avoid an ugly discussion when you have a four-level nested structure in this column.

Should I do a in the FIELD? Uh, I can already see the pitchforks in front of my office...

So... what would you do when someone came to you with a table with jsonb columns?

molinaro-m commented 11 months ago

I copy @msdemlei last question. How do I make a little more interoperable JSONB objects stores in a DB field?

pdowler commented 11 months ago

OK, that's fair. char and * are even less useful. Let me ask a leading question...

Right now we have xtype="moc" and the serialization is defined as "ascii moc from REC-MOC-2.0". Let's say a future version of the MOC standard defines "json moc" serialization and people want to put that into a VOTable. How would we handle that? xtype="jsonmoc"? xtype="jmoc"? (shudder) ... xtype="json:moc"? There are two pieces of info to convey and I kind of like the latter one.

I kind of think of xtype as a URN anyway and if we thought about having something like xtype="json:moc" in future, we could then also say that xtype="json" is just a less specific base... not super expressive but more expressive than nothing. We could still encourage people to try to say more with xtype="json:foo". We would also be able to handle different serializations of types without having to make up arbitrary compound words. I wouldn't want this to turn into UCD with rules about primary and qualifiers and such... still just strings but with a style that trivially conveys multiple bits of info. In that context, to me "json:moc" looks better than "moc:json" because that's how I expect people to say/read it.

thoughts?

msdemlei commented 11 months ago

On Fri, Dec 01, 2023 at 09:05:44AM -0800, Patrick Dowler wrote:

Right now we have xtype="moc" and the serialization is defined as "ascii moc from REC-MOC-2.0". Let's say a future version of the MOC standard defines "json moc" serialization and people want to put that into a VOTable. How would we handle that? xtype="jsonmoc"?

I'd first say let's not invent more ways to say the same thing.

encourage people to try to say more with xtype="json:foo". We would also be able to handle different serializations of types without

Oh, please, let's just not have (more) things with more than one serialisation in VOTable at all; if there's one way to say it, that's enough. I think there is never (+/- epsilon) a benefit to supporting multiple serialisations that outweighs the pains in interoperability and implementability (tabledata vs. binary2 may be in that epsilon).

As I said above, I do see a benefit in being able to somehow inform a client about the inner structure of a json-serialised field (perhaps JSON-schema, perhaps MIVOT, perhaps still something else) -- but I'd argue very strongly against trying to shoehorn this into an xtype string. These, I'd say, should be simple, atomic and hence unparseable.

pdowler commented 11 months ago

I'm not saying it's a good idea to have mutliple serializations (ascii and json, for example), just that xtype="json" by itself is not saying what kind of structure is there, just the syntax of the serialization. If we want to also use xtype to convey syntax then it opens up that can or worms... I'm against conveying syntax because to me it's very low (admittedly non-zero) value.

If someone wants to convey a some json, they surely have a specific structure in mind. They should use an experimental xtype that denotes the structure, e.g. xtype="proto:foo" and document (tap_schema.columns.description) what the syntax is. That's what I would do in that case.

msdemlei commented 11 months ago

On Tue, Dec 05, 2023 at 09:11:30AM -0800, Patrick Dowler wrote:

(ascii and json, for example), just that xtype="json" by itself is not saying what kind of structure is there, just the syntax of the

I suppose our dissent basically is whether knowing something is JSON is already useful enough by itself. I'm not sure what a good way to settle that dissent is... can I argue that there is a media type for json, too, and that doesn't have a defined way to convey the structure, either?

If someone wants to convey a some json, they surely have a specific structure in mind. They should use an experimental xtype that denotes the structure, e.g. xtype="proto:foo" and document (tap_schema.columns.description) what the syntax is. That's what I would do in that case.

Oh, that may be another point of dissent: Frankly, I believe that the overwhelming majority of what people will stuff into jsonb columns is going to be fairly adhoc (which is why I'm not happy about the concept either -- but since it's there, I think we need to do something about it). This means that having two different json columns with the same structure/schema will be the huge exception, and hence we'll basically have an xtype string per json column.

If that assessment is about right, the plan to define the structure in the xtype string in some way is a non-starter. No client will ever bother to learn such one-shot xtypes.

If we really want a structure definition for json columns, it'll have to be machine-readable and sit outside of the xtype.

Do we need such a mechanism to make a json xtype useful? Well, that's back to dissent 1.

pdowler commented 11 months ago

You could be right. I feel like I only know half of a use case, but maybe it really is ad-hoc and that is the whole use case.

Zarquan commented 11 months ago

Hypothetical use case - a machine learning pipeline needs to add attributes to their data that contain the parameters used to train the algorithm. Their application is written in Python and their data scientists are familiar with reading and writing JSON fragments so saving the parameters as JSON strings makes sense for their project.

One of the steps in their pipeline adds the training and validation parameters for their ML algorithm to the data. They then want to upload this data to a TAP service, cross match it with the remote dataset and then pass the results on to the training and validation steps of their pipeline.

Markus is right, the descriptions for the columns will be specific to this pipeline and the project development team are unlikely to want to add complicated code to their client to add metadata that describes something that their team already knows about.

As far as they are concerned, the only things that need to understand the JSON content of the columns are the pipeline stage that adds them and the pipeline stage that reads them. As far as the rest of the world is concerned, they are just strings.

Pat's suggestion of xtype="proto:foo" provides part of the solution, it distinguishes between IVOA core xtypes (no prefix) and application specific xtypes (prefix:), but it doesn't provide a mechanism for describing the type.

Embedding a JSON schema for the column in the VOTable header would be limited to this VOTable instance; it would not survive the TAP upload, JOIN query and download.

Equally, embedding a JSON schema for the column in TAP schema would be linked to an originating TAP service, so it couldn't be used to describe columns added by other software e.g. the ML pipeline.

From Markus's comment:

If that assessment is about right, the plan to define the structure in the xtype string in some way is a non-starter. No client will ever bother to learn such one-shot xtypes.

If we really want a structure definition for json columns, it'll have to be machine-readable and sit outside of the xtype.

Do we need such a mechanism to make a json xtype useful? Well, that's back to dissent 1.

In this example I don't see anything that would justify a machine readable description of the columns.

The ML project itself just needs a unique string to identify the columns, which could either be column name or a simple xtype.

To the outside world, the data type defines them as strings, which is enough to transport the VOTable and read the fields.

No one outside the ML project is going to write code that reads the xtype metadata for arbitrary JSON columns and dynamically generate something that unpacks the JSON and interprets it. To do what ? The only software that can realistically use the content of the JSON columns will be written by members of the ML project, who already know what is in them.

Equally, extending the IVOA data models to include things like 'machine learning training parameters' is way out scope.

However, I would argue there is a case for providing some level of descriptive metadata for users outside the original project. Someone who received a VOTable of results that were generated by the ML project, doesn't necessarily know where the data originated from (result of a query on another service perhaps), and wants to understand what is in the extra fields.

In which case xtype="http://www.project.org/types/foo" might be a useful pattern to promote.

Our standard could say :

Identifiers for standard DALI xtypes do not have prefixes. Identifier for user defined xtypes should be a valid URI. Preferably, a resolvable URL that points to a resource that describes the type. The simple form, xtype="project:foo", is sufficient for a quick prototype, but deployed production systems should use a resolvable URL e.g. xtype="http://www.project.org/types/foo". Where "http://www.project.org/types/foo" points to a resource that describes the data content and serialization format.

In our ML pipline example, they could use URLs that point to pages on their own website to describe the training and validation parameters:

Everything else can treat the xtypes as opaque strings, no different to project:training-param and project:validation-param.

Using resolvable URLs to human readable descriptions provides a low cost way for the ML project to explain to external users what their special columns contain.

Using URLs of pages on their own project website is an easy way to ensure they are globally unique, and the recipient doesn't need any special software to resolve them, just a normal web browser.

msdemlei commented 11 months ago

On Wed, Dec 13, 2023 at 11:25:47AM -0800, Zarquan wrote:

Do we need such a mechanism to make a json xtype useful? Well, that's back to dissent 1.

In this example I don't see anything that would justify a machine readable description of the columns.

No, certainly not; in your example, a group communicates with itself, using custom software. For them, the column name is plenty good, and they clearly don't need any xtype at all -- which is why this example perhaps isn't ideal to figure out what a useful feature for rather different case of communication between people (and their machines) that don't have any contract between them except for our standards.

However, I would argue there is a case for providing some level of descriptive metadata for users outside the original project.

Perhaps -- but please let's not use xtype for that. xtype is machine-readable: depending on what's there, a VOTable parser produces polygon instances, or timestamps, or perhaps lists and dictionaries (with a json xtype) from some less structured literal.

When instead xtype suddenly contains links to human-readable documentation, as in...

In which case xtype="http://www.project.org/types/foo" might be a useful pattern to promote.

...that will make the parsers' lives a lot more difficult.

Associating longer, human-readable documentation to a column, meanwhile, is a use case I've been after for a long, long time (in the form for "table footnotes") -- I think your use case could fit in there, too.

Although:

Using URLs of pages on their own project website is an easy way to ensure they are globally unique, and the recipient doesn't need any special software to resolve them, just a normal web browser.

...from a curation standpoint (and in particular when I hear "ML", which still has undertones of move fast and break things) I'd perhaps tend to discourage linking such longer pieces of documentation, because when people revisit such a VOTable a year later, chances are the URL will dereference to a 404.

DaCHS has been using GROUP-s with FIELDref-s for table notes for a long time now; see https://dc.zah.uni-heidelberg.de/arihip/q/cone/scs.xml?RA=0&DEC=0&SR=1 for an example, look for "note-" in the result.

I'd be totally in on standardising that (plus perhaps an ad-hoc utype to identify the GROUP-s as table footnotes) for this kind of use case.

But that's a different bug against (probably) a different standard, I think.

Zarquan commented 11 months ago

different case of communication between people (and their machines) that don't have any contract between them except for our standards.

This example is about communication between people that don't have any direct contract between them except for what is in this one VOTable.

xtype is machine-readable: depending on what's there, a VOTable parser produces polygon instances, or timestamps, or ....

... or application specific JSON encoded data like the ML parameters in this example.

This doesn't exclude simply putting xtype="json". This is for where we want to provide a mechanism for telling the recipient a bit more about the application specific content than just the serialization.

When instead xtype suddenly contains links to human-readable documentation ... that will make the parsers' lives a lot more difficult.

Objectively, why is this more complex than Pat's suggestion of prefix:foo?

The client software does not need to parse the URL, it just treats it as an opaque string.

if (xtype == "point")
    // Create an IVOA Point object
else if (xtype == "prefix:foo")
    // Create an application specific Foo object
else if (xtype == "http://www.project.org/types/bar")
    // Create an application specific Bar object
else
    // Just treat it as a sring.

The URL is simply there to provide placeholder for the data scientist who added the extra columns to communicate some additional information to someone in the future.

A client that doesn't know what a 'Foo' or 'Bar' is would just treat the column as a String.

if (xtype == "point")
    // Create an IVOA Point object
else
    // Just treat it as a sring.

.. I'd perhaps tend to discourage linking such longer pieces of documentation, because when people revisit such a VOTable a year later, chances are the URL will dereference to a 404.

Yep, I agree. It isn't ideal. Let's say it only has a 10% chance of resolving, but that is 10% more than prefix:foo.

DaCHS has been using GROUP-s with FIELDref-s for table notes for a long time now;

This is really useful stuff, but how much of it would survive a TAP upload, JOIN query and download via someone else's service ?

I'd be totally in on standardising that (plus perhaps an ad-hoc utype > to identify the GROUP-s as table footnotes) for this kind of use case.

As your 'move fast and break things' comment suggests, some ML data scientists may indeed be less rigorous about engineering best practices.

They may be in a hurry to get things done, so our advice needs to be as easy to use as possible:

It guarantees uniqueness and gives them a way to link to extra information if they want to.

It just provides the mechanism. How well they use it is up to them.

msdemlei commented 11 months ago

On Thu, Dec 14, 2023 at 06:56:05AM -0800, Zarquan wrote:

When instead xtype suddenly contains links to human-readable documentation ... that will make the parsers' lives a lot more difficult.

Objectively, why is this more complex than Pat's suggestion of prefix:foo?

I give you it's not worse, but I didn't like Pat's prefix:foo any better on the same grounds: you cannot put that into general-purpose VOTable parsers. You...

The client software does not need to parse the URL, it just treats it as an opaque string.

if (xtype == "point")
    // Create an IVOA Point object
else if (xtype == "prefix:foo")
    // Create an application specific Foo object
else if (xtype == "http://www.project.org/types/bar")
    // Create an application specific Bar object
else
    // Just treat it as a sring.

...the xtype processing is in the VOTable parser, and as a library/program user you don't notice any of that. You just get a (more) useful object, that's it; with xtype=json you'd just get a parsed dict or list and wouldn't have to call your environment's equivalent of json.loads yourself.

Which means that code would need to sit in, say, STIL or astropy.votable, or gavo.votable. If we try to convince the maintainers of these code bases to write some "else if", I think we should be able to come up with a proper xtype string -- and the chances that they'll take up some random group's convention are slim.

Let's not even imply a group could ask for having such a "custom" xtype. I certainly don't want to have to discuss whether such code shouldn't go into, say, astropy after all.

The URL is simply there to provide placeholder for the data scientist who added the extra columns to communicate some additional information to someone in the future.

Let's not mix human-readable documentation with machine-readable metadata if we can help it.

DaCHS has been using GROUP-s with FIELDref-s for table notes for a long time now;

This is really useful stuff, but how much of it would survive a TAP upload, JOIN query and download via someone else's service ?

Let's first standardise it. Once that standard is there, it's not terribly hard to pull a bit of extra column metadata through a TAP upload at at least the level of an xtype (which means: if you write 4foobar, it is likely that the engine will discard any footnotes foo or bar may have had, but that's a reasonable thing, and you'd almost certainly would not pull through xtypes of either foo or bar in that situation either).

Zarquan commented 11 months ago

Which means that code would need to sit in, say, STIL or astropy.votable, or gavo.votable ....

It would be simple to create an extendable parser that would accept handlers for additional xtypes.

The idea is that the list of xtypes is not fixed, and projects can add their own custom types later.

Let's not mix human-readable documentation with machine-readable metadata if we can help it.

Isn't your notes field human readable ?

Getting back on subject.

The initial suggestion was xtype="json" would return a parsed JSON object, avoiding the need to call json.loads yourself.

Yep, OK with that.

Then we started asking, what if the JSON object represented an application specific complex Object? Could we represents that as xtype="json:foo", which mixes the serialization and object class.

Possibly not so good, because anything that could handle an application specific Foo object would probably already know that it should be serialised in JSON. So in reality, it doesn't gain us that much more than xtype="foo".

Would it be better to recommend a URI structure of xtype="prefix:foo", where anything that matched the pattern [a-z]*:.* would be considered an application specific xtype.

It works, but it misses an opportunity to say something about what the application specific Foo object is.

If we use a URL rather than just a URI, it gives the user that adds the column the opportunity to point to more metadata.

This doesn't exclude adding the notes in the FIELD metadata as you suggest.

It just says if we are going to have application specific URIs, we should recommend that they use a URL that points to something.

Zarquan commented 11 months ago

If we are suggesting that xtype="json" is useful because it avoids the need to call json.loads.

We should also have xtype="yaml" that parses the field as YAML and returns a similar object without having to load the equivalent YAML parsing libraries.

As part of the group looking at updating our standards to move away from XML, I suggest that whenever we add something to handle JSON, we should also add the same functionality to handle YAML.

This doesn't add much more complexity, but it does make us think about making our standards serialization agnostic, rather than just moving from one fixed serialization to another.

msdemlei commented 11 months ago

On Fri, Dec 15, 2023 at 04:10:21AM -0800, Zarquan wrote:

Isn't your notes field human readable ?

Sure -- as is DESCRIPTION, say. I'm not arguing against human-readable metadata in VOTable at all, just against using parts of metadata so far intended for machine consumption to somehow sneak in human-readable parts that a machine simply can't handle.

Possibly not so good, because anything that could handle an application specific Foo object would probably already know that it should be serialised in JSON. So in reality, it doesn't gain us that much more than xtype="foo".

I'd look less at URIs here and more at media types that have a similar thing, where, for instance, the media type application/x-desise+json that we use for our custom vocabulary format still indicate what's coming back is JSON even to clients that don't know Desise.

I'd not struggle against such a thing very hard, but I'd say: as long as we don't have anyone who actually wants to write code that would profit from this kind of thing, let's postpone it. True, saying "discard anything in front of a plus in xtype" would slightly improve backwards compatibility for such changes, but not so decisively that I'd find speculative features justified.

If we use a URL rather than just a URI, it gives the user that adds the column the opportunity to point to more metadata.

-- but then the client has no way to work out it's json in the first place, and that's a loss, no?

msdemlei commented 11 months ago

On Fri, Dec 15, 2023 at 04:26:18AM -0800, Zarquan wrote:

If we are suggesting that xtype="json" is useful because it avoids the need to call json.loads.

We should also have xtype="yaml" that parses the field as YAML and returns a similar object without having to load the equivalent YAML parsing libraries.

Well, we could, but

(a) nobody's asking us for that, (b) there's no database column type named "yaml" (though admittedly you could serialise the content of jsonb columns as yaml -- but why would you do that?) (c) the benefits of another "complex data in a column" type seem rather limited to me -- don't forget, parsers (more or less) have to implement this, and I think they won't appreciate a stance of "we don't want to lock in our data producers, so see how you consumers cope". (d) yaml support, while widespread, still is a lot more limited that json support (e.g., no module in the Python standard library), and (e) yaml uses whitespace for grammar. In theory, VOTable preserves whitespace (by virtue of TD being xs:string even in TABLEDATA). In practice, given XML prettyprinters and all that, I'd rather not bet on it.

So... I find json borderline already (as I said above: it's an open invitation to skip sound data design and metadata declaration). Yaml, however, at this point even lacks the pragmatic points that made me decide json is somewhat on the "accept" side of the border.

Zarquan commented 11 months ago

To keep this issue on track, I've created a separate issue for the YAML discussion, see #35.

(a) nobody's asking us for that,

I'm asking for it, because it makes us think outside the box.

The rest of the discussion in #35.

Zarquan commented 11 months ago

it's an open invitation to skip sound data design and metadata declaration

Using a URL in the xtype allows the user to link to a schema, providing a method for including data design and metadata declaration ?