ivoa / dm-usecases

The is repo gathers all the material to be used in the DM workshop 2020
The Unlicense
1 stars 3 forks source link

Minimal provenance in VOTable output #37

Open gilleslandais opened 3 years ago

gilleslandais commented 3 years ago

For VizieR it will be really appreciated (to not say required) to have common way to provide a minimal origin information.

The mango VizieR prototype uses the dock "associatedData" to link a remote URL which contains a "complete" VOProvenance. I would like to add an other concise provenance output in the VOTable (for "naive" client)

The minimal provenance for a VOTable are: author+year_of_publication, doi or bibcode of the reference article. In DatasetDM, I didn't see a clear distinction between creator/author .. Markus do you have an example of this serialization in your output?

.. and I would like more - but is it possible in a concise serialization: to specify a short annotation to specify the origin of a measure - e.g. the filter configuration: with the curator + a URL. Any idea ?

msdemlei commented 3 years ago

On Fri, Apr 16, 2021 at 10:22:24AM -0700, gilleslandais wrote:

For VizieR it will be really appreciated (to not say required) to have common way to provide a minimal origin information.

The mango VizieR prototype uses the dock "associatedData" to link a remote URL which contains a "complete" VOProvenance. I would like to add an other concise provenance output in the VOTable (for "naive" client)

The minimal provenance for a VOTable are: author+year_of_publication, doi or bibcode of the reference article. In DatasetDM, I didn't see a clear distinction between creator/author .. Markus do you have an example of this serialization in your output?

No. But now that you mention it, what might actually be smart is sync VOResource's "implicit" (there's no VO-DML (yet?)) data model with Dataset dm and friends.

I'm not 100% sure I'd like to see a lot of VOResource in instance documents (as usual: What should clients do with it?), and we'd have to think about whether there ought to be a single "resource" DM or whether some of the types could become DMs of their own. But whatever the result of these considerations: if DM deals with Registry content, we should make sure there are no unnecessary inconsistencies.

.. and I would like more - but is it possible in a concise serialization: to specify a short annotation to specify the origin of a measure - e.g. the filter configuration: with the curator + a URL. Any idea ?

Well, I've been advocating in-VOTable provenance forever, and this is a nice example. While I'm too lazy to properly re-read the ProvDM docs (or its VO-DML), this would basically look like this in my annotation:

<TEMPLATES>
  <INSTANCE dmtype="prov:Agent" id="fred">
    <ATTRIBUTE dmrole="name" value="Fred Hoyle"/>
    <ATTRIBUTE dmrole="affiliation" value="University of Cambridge"/>
  </INSTANCE>

  <INSTANCE dmtype="prov:Activity" id="reduction">
    <!-- Embedding parameters probably isn't going to be so simple in
    current ProvDM, but in an example this might be ok -->
    <ATTRIBUTE name="parameters">
      <COLLECTION>
        <INSTANCE dmtype="prov:Parameter">
          <ATTRIBUTE name="name" value="filter profile"/>
          <ATTRIBUTE name="value" value="http://whatever"/>
        </INSTANCE>
        <INSTANCE dmtype="prov:Parameter">
          <ATTRIBUTE name="name" value="magic fudge parameter"/>
          <ATTRIBUTE name="value" value="27"/>
        </INSTANCE>
      </COLLECTION>
    </ATTRIBUTE>
    <ATTRIBUTE dmrole="WasAssociatedWith" ref="fred"/>
  </INSTANCE>

  <INSTANCE dmtype="prov:Entity" id="reduced_mag">
    <!-- this isn't pretty, but the ProvDM authors probably didn't
    expect immediately resolvable ids; this kind of thing would need
    some thought in the mapping doc (and perhaps a fix in the DM, as
    I'd much rather use a proper ref attribute here -->
    <ATTRIBUTE dmrole="id">#mag_v</ATTRIBUTE>
    <ATTRIBUTE dmrole="WasGeneratedBy" ref="reduction"/>
  </INSTANCE>
</TEMPLATES>

<FIELD ID="mag_v" .../>

It's an interesting exercise to add a cutout FIELD and use it as an Entity that's used by #reduction... I'll do that on request, because I'll have to brush up on ProvDM again to confidently write such a thing.

Also note how this to me is an argument against the division between GLOBALS (IIRC) and TEMPLATES in the original VO-DML annotation proposal: Nothing at all is saved if #fred would need to jump into Globals here (or jump back to TEMPLATES if its @.***="name"] became a ref to some FIELD).

mcdittmar commented 3 years ago

gilleslandais wrote: In DatasetDM, I didn't see a clear distinction between creator/author .. Markus do you have an example of this serialization in your output?

msdemlei wrote: No. But now that you mention it, what might actually be smart is sync VOResource's "implicit" (there's no VO-DML (yet?)) data model with Dataset dm and friends. I'm not 100% sure I'd like to see a lot of VOResource in instance documents (as usual: What should clients do with it?), and we'd have to think about whether there ought to be a single "resource" DM or whether some of the types could become DMs of their own. But whatever the result of these considerations: if DM deals with Registry content, we should make sure there are no unnecessary inconsistencies.

The DatasetDM does map its content to the Resource Metadata elements.. so is in a sense, actualizing the 'implicit' model, and is VO-DML compliant.

As for creator/author.. what distinction are you looking for?

I think 'author' implies the Dataset is a paper or some sort, rather than a Photometric Filter or LightCurve.

I've also been curious to see how Provenance will get conveyed in the context of Datasets. In the older models (Spectrum, Characterization, ObsCore) there is, if I recall, a Provenance placeholder node in the Observation/Dataset metadata area. I expect there is a distinction between identifying the Agents/Entities involved in THIS dataset (part of DatasetMetadata), vs identifying the HISTORY of the Dataset ( Activity which created it, progenitors, etc ).

I would like to add an other concise provenance output in the VOTable (for "naive" client)

Do you mean outside of the Annotation syntax? similar to a COOSYS/TIMESYS element directly in VOTable.

gilleslandais commented 3 years ago

Ok for the datasetDM - in VizieR context, CDS is the publisher (Curator) and the author is the Creator (including the biblio reference in the same DataID) - so it could be added in AssocDataDock. But you are true , a including (refCode, author, year) is something that I prefer !

For measures it is may be more complicate - In vizier the photometry filter characteristics is not a part of the original data (it could - but often it is added by CDS who assigned a filter or a similar filter for magnitude columns) - if VizieR provides the photometry characteristics, it is important to specify the origin. ProvDM is adapted for that, but the parsing is may be a little discouraging for clients..

So may be a simple way could be just a comment ?.. in that case is it better to put the (origin) comment on Mango:Parameter.comment or on Mango:stcextend.Photometry ?

mcdittmar commented 3 years ago

On Mon, Apr 19, 2021 at 1:43 PM gilleslandais @.***> wrote:

For measures it is may be more complicate -

In vizier the photometry filter characteristics is not a part of the original data (it could - but often it is added by CDS who assigned a filter or a similar filter for magnitude columns) - if VizieR provides the photometry characteristics, it is important to specify the origin. ProvDM is adapted for that, but the parsing is may be a little discouraging for clients..

So may be a simple way could be just a comment ?.. in that case is it better to put the (origin) comment on Mango:Parameter.comment or on Mango:stcextend.Photometry ?

So, you can put the filter at the 'normal' space according to the model, but now you want to tag/record that this is something added by CDS, and not part of the original dataset... That's a good thread to include for exploring the Provenance usage within datamodel instances.

Technically, I think one answer would be that this is a NEW Dataset, created by CDS through an Activity which assigned the Filter to the original Dataset. So, your Provenance would point to the original Dataset which does not include the Filter, is created by XYZ, etc. That seems rather unappetizing in practice though.

lmichel commented 3 years ago

If the purpose of the embedded Prov is just to say whether a filter has been added by the CDS, we could consider doing things in a simpler way.

As PhotFilter still has to be wrapped into MANGO, we can add a field telling the filter origin. This is somehow similar to the reduction status Mango had at the beginning (raw, calibrated..). This value could be carried either by an enum or a vocabulary.

msdemlei commented 3 years ago

On Mon, Apr 19, 2021 at 10:43:02AM -0700, gilleslandais wrote:

But you are true , a including (refCode, author, year) is something that I prefer !

Ah well... that's the temptation of quick and simple solutions... I'd always be in favour of those except we already have a more general and comprehensive thing for that. And there's few things worse than having two mechanisms that do the same thing -- it's the guaranteed end of interoperability. If you're unlucky, exactly half the producers will implement one but not the other, and exactly half the consumers will implement one but not the other. Then, the likelihood that an annotation can be used is one in four.

Let's not do that.

It also makes us look bad.

For measures it is may be more complicate - In vizier the photometry filter characteristics is not a part of the original data (it could - but often it is added by CDS who assigned a filter or a similar filter for magnitude columns) - if VizieR provides the photometry characteristics, it is important to specify the origin. ProvDM is adapted for that, but the parsing is may be a little discouraging for clients..

It's even more discouraging if clients can't predict if they have the complex or the simple thing.

Of course, in the RFC of ProvDM I've also argued that we're introducing too much complexity in one go, so you have my sympathy when you say that full IVOA ProvDM is perhaps a little, if you will, discouraging.

But rather than building an incompatible alternative, I'd much prefer if we defined a ProvCore (say) that basically is a VO-DML mapping of W3C prov and that is a true subset of ProvDM (so full ProvDM consumers understand it). Or we just define a "pattern" how this kind of thing is to be written that people can apply without having to read all of ProvDM.

So may be a simple way could be just a comment ?.. in that case is it better to put the (origin) comment on Mango:Parameter.comment or on Mango:stcextend.Photometry ?

That's of course another thought (I'm calling it Mark's law because Mark Taylor taught it to me): Only make machine-readable what machines want to read.

What you describe sounds like something that perhaps is just fine somewhere where humans that care in a particular case can reliably find it.

mcdittmar commented 3 years ago

On Wed, Apr 28, 2021 at 5:44 AM msdemlei @.***> wrote:

But rather than building an incompatible alternative, I'd much prefer if we defined a ProvCore (say) that basically is a VO-DML mapping of W3C prov and that is a true subset of ProvDM (so full ProvDM consumers understand it). Or we just define a "pattern" how this kind of thing is to be written that people can apply without having to read all of ProvDM.

+1 on generating a pattern to map provenance information to the provenance model.