ivoa-std / VOTable

VOTable Format Definition
4 stars 15 forks source link

Enable generic FIELD/PARAM metadata #29

Open mbtaylor opened 2 years ago

mbtaylor commented 2 years ago

There is a discussion at https://github.com/ivoa-std/DataLink/issues/51 about how to label PARAM elements as required or optional, as well as indicating their cardinality. Various more or less unsatisfactory possibilities (ab)using existing VOTable features have been suggested (add new value "optional" to the existing "type"/"location"/"trigger" vocabulary of the rarely-used @type attribute; use some syntax within the @utype attribute; use LINK with some RDF semantics; do it using GROUPs). Maybe some of these solutions could do the job, but they are really not intuitive and not using the VOTable elements in the way that they were intended.

This sort of thing (wanting to transmit column metadata that does not fit well into the existing VOTable schema) has cropped up several times in the past; another example is wanting to record the HEALPix Order of a column containing a HEALPix index (http://mail.ivoa.net/pipermail/apps/2016-August/001131.html). It may also occur in other communities wanting to re-use VOTable; e.g. the Cluster Science Archive at ESA is now using TAP and hence VOTable; their columns have metadata labelled "CAIO ATTRIBUTE" (see https://www.cosmos.esa.int/web/csa-guide/tap-tables-and-views) which doesn't fit anywhere in the VOTable metadata model.

GROUPs can probably be made to do these jobs, but the result is verbose, error-prone, and hard to read.

I suggested a long time ago (2009? but I can't find a record of it) the possibility to allow generic per-column metadata in VOTable, but it didn't seem to be popular then; a decade on I'm going to have another go.

So: I suggest the option to include arbitrary key-value pairs within the FIELD element (and hence PARAM which derives from it). We could define a new element for this, say like:

   <FIELD name="healpix_id" datatype="int">
     <META key="healpix_order" value="8"/>
   </FIELD>

or we could re-use the existing INFO (or even PARAM) element:

  <FIELD name="healpix_id" datatype="int">
    <INFO name="healpix_order" value="8"/>
  </FIELD>

VOTable itself would not impose any constraints on the key content beyond declaring it an xs:token or similar; but the general mechanism would be available for other standards or private conventions to use where required. I believe this is a fairly low-impact modification to the VOTable schema that could provide an easy way forward for https://github.com/ivoa-std/DataLink/issues/51 (add to relevant PARAMs a META child with key="required" or "cardinality" or even something PDL-related) and to other similar items we're likely to come up against in the future. Note this approach has some things in common with the Extensible Vocabulary idea used (I claim successfully) in SAMP.

msdemlei commented 2 years ago

On Wed, Oct 20, 2021 at 03:53:36AM -0700, Mark Taylor wrote:

   <FIELD name="healpix_id" datatype="int">
     <META key="healpix_order" value="8"/>
   </FIELD>

or we could re-use the existing INFO (or even PARAM) element:

  <FIELD name="healpix_id" datatype="int">
    <INFO name="healpix_order" value="8"/>
  </FIELD>

I'd certainly not struggle against this if it were to be put into VOTable (with a slight preference for META), but given that using LINK with action="rdf" essentially does the same thing without extra elements or changes to the VOTable standard (and somewhat meshes into the whole Linked Data thing): what exactly makes you dislike it. Do you worry that existing clients will mis-interpret it? Is it that you think LINK is inappropriate for conveying metadata in general?

Also, if we touch VOTable... Well, I don't want to confuse the situation, but shouldn't we then rather go for RDFa altogether?

mbtaylor commented 2 years ago

LINK is rather under-documented. It has 8 attributes, all optional, and it's not obvious to me why you'd write "<LINK action='rdf' content-role='#mandatory'/>" rather than for instance "<LINK content-role='rdf' title='mandatory'/>" or something else (the latter looks more in line with the content-role values documented in VOTable Sec 3.7). What the LINK documentation does say is that its supposed to "provide pointers to external resources through a URI", which I don't think is what's happening here (it has no href, or indeed gref).

I'm sure you could come up with a way to do it, but if so I'd argue for documenting that, in the VOTable document, as a standard way to represent miscellaneous key/value metadata. If that can be done in a way that's easy to re-use for the next time we come up against this sort of thing, I'll be happy enough.

I admit to not understanding RDF(a) well enough to know what benefits we'd get from tying into that (somewhat complicated) standard.

msdemlei commented 2 years ago

On Wed, Oct 20, 2021 at 09:09:17AM -0700, Mark Taylor wrote:

LINK is rather under-documented. It has 8 attributes, all optional, and it's not obvious to me why you'd write "<LINK action='rdf' content-role='#mandatory'/>" rather than for instance "<LINK content-role='rdf' title='mandatory'/>" or something else (the latter looks more in line with the content-role values documented in VOTable Sec 3.7). What the LINK documentation does

I proposed using the action attribute to discourage clients from trying to do anything clickable with the links. And content-role almost sounds like "property", so that I'd want for keeping the triples.

say is that its supposed to "provide pointers to external resources through a URI", which I don't think is what's happening here (it has no href, or indeed gref).

Don't ask me what gref is, but href you'd use in my scheme when the object actually is some non-immediate resource; perhaps:

<PARAM datatype="char" arraysize="*" name="object_id">
  <LINK action="rdf"
    content-role="#admitted-values-csv"
    href="http://example.org/our-project-s-object-ids.csv"/>
  <LINK action="rdf"
    content-role="#multiplicity"
    href="http://www.ivoa.net/rdf/pdl#repeatable-1-n
</PARAM>

(where one references a file, the other an RDF term; for RDF that's fine, whether clients would like that I can't tell).

But I give you all that is arguable at best, and certainly not terribly compelling.

Still, I'd say rather than let LINK rot on, simply claiming it for a clear use case is not totally unreasonable. But really, META would work for me as well.

I'm sure you could come up with a way to do it, but if so I'd argue for documenting that, in the VOTable document, as a standard way to represent miscellaneous key/value metadata. If that can be done in a way that's easy to re-use for the next time we come up against this sort of thing, I'll be happy enough.

Sure: If we go this way, this should be mentioned in the next release of VOTable. But we can start prototyping immediately, and we can start writing a note explaining what we're doing (for possible later inclusion into datalink) independently of either VOTable's or Datalink's release schedules.

I admit to not understanding RDF(a) well enough to know what benefits we'd get from tying into that (somewhat complicated) standard.

Roughly: using RDFa makes our lives a bit harder. But we can potentially re-use external RDFa tools. For DALI examples (where we're using XHTML+RDFa), that's kind of cool, as people can use RDF validators to see what semantics they're conveying.

For "normal" VOTables, the benefit probably a good deal smaller, because the holy grail (getting triples out of the data content) would require so many extra conventions that I don't think any standard RDFa tool would even remotely do the right thing.

On the other hand, from a marketing perspective being able to claim "our container format is ready for Linked Data by virtue of RDFa" might help when trying to infatuate the funding agencies.

mbtaylor commented 2 years ago

On the other hand, from a marketing perspective being able to claim "our container format is ready for Linked Data by virtue of RDFa" might help when trying to infatuate the funding agencies.

I'm not saying your cynicism is necessarily out of whack with reality here, but I'm a bit shocked to hear politics-over-engineering-quality arguments coming from you. In any case my guess is we'll be disappointed if we base our decisions on the expectation of funding bodies packed with RDFa enthusiasts, so let's not use this as a consideration.

msdemlei commented 2 years ago

On Thu, Oct 21, 2021 at 01:41:18AM -0700, Mark Taylor wrote:

politics-over-engineering-quality arguments coming from you. In any case my guess is we'll be disappointed if we base our decisions on the expectation of funding bodies packed with RDFa enthusiasts, so let's not use this as a consideration.

Well... there is a chance that "Linked Data" might re-emerge as the next or next-but-one hype, but as I said, VOTable isn't really a terribly good match for RDFa anyway.

That digression aside, a DaCHS operator asked about having fields pre-set when using the XSL for formatting datalink (https://github.com/msdemlei/datalink-xslt). Since we can't use value for that in datalink (it value-ed PARAMs are constant in datalink), I thought this might be a nice exercise and I just went for trying out how this would come out with LINK.

The result is commit a4e5776b to the datalink XSLT; there's no public service using that yet, but a test service produces Datalink inputParams like this:

<PARAM datatype="int" name="Scenario" ucd="meta" value="">
  <DESCRIPTION>Number of the wanted scenario - 1: climavrEUV, ...
  </DESCRIPTION>
  <VALUES>
    <MIN value="1"/>
    <MAX value="33"/>
  </VALUES>
  <LINK action="rdf" content-role="#pre-set" value="24"/>
</PARAM>

-- I still think that's a reasonable start, and I'd be happy co-authoring a note exploring how far this can be taken before it turns really ugly.