ivoa / dm-usecases

The is repo gathers all the material to be used in the DM workshop 2020
The Unlicense
1 stars 3 forks source link

MANGO Annotation Scope #18

Open lmichel opened 3 years ago

lmichel commented 3 years ago

This issue is a fork of #12 that diverged from the initial dependant axes topic

Last message (https://github.com/ivoa/dm-usecases/issues/12#issuecomment-802901294):

On Fri, Mar 19, 2021 at 07:23:56AM -0700, Laurent MICHEL wrote:

The scope of the annotations must go beyond simple column annotations which must remain supported though. I detailed it here section 2. I'm starting to be unsure whether we are actually disagreeing on much here -- and I've not found anything in that section 2 that I'd need to contradict.

So, perhaps a clarification: is my time series use case "single column annotation", and if so, why? What actual usage would go beyond what's possible there?

My point, is since we have a self-consistant model made with a hierarchy of elements identified with dmtype, dmrole and others things, the annotation must be something matching that structure.

Well, the thing with dmrole and dmtype to me is the annotation, but I think what you're saying here is that the annotation should be directly derived from the model.

That I wholeheartedly agree with, and that's why I'm so concerned about the current MCT proposal -- if it were some abstract musing, I'd be totally ok with it. But when the model defines the annotation structure. whatever we do in the model has concrete operational consequences. Which, mind you, is fine -- we'll have to deal with them somewhere and the DM is the right place for that.

Once you have it, you can use accessors based on those identifiers. That is what I call a public API does no refer to any native data element but only to model elements

...and I still cannot figure out why you want this -- after all, the point of the whole exercise IMNSHO is to add information to VOTables (and later perhaps other container formats) that is not previously in there.

What would the use case for your free-floating annotation be, if this is what your are proposing?

I the examples I showed up is these use-cases, I transform the annotation in Pyhton dictionnaries that are easily serializable in JSON (a good point for data exchange).

In pseudo code, this would look like this:

 annotation_reader = AnnotationReader(my_votable)
 if annotation_reader.support("mango") is False:
   sys.exit(1)

 mongo_instance = annotation_reader.get_first_row()
 print(mongo_instance.get_measures())
 ['pos", "magField"]
 print("Magnetic field is:" + mongo_instance.get_measure("magField"))
 Magnetic field is: 1.23e-6T +/- 2.e-7

This wouldn't require Python classes implementing the model (fundamental point)

I claim that the annotation must be designed in a way that allows this in addition to basic usages.

-- but why would you want to do this JSON serialisation? Wouldn't it be much better overall to just put that value into a VOTable and transmit that rather than fiddle around with custom JSON dictionaries? In particular when there are quite tangible benefits if you make it explicit in the model what exactly it is that you're annotating?

By the way, if by "wouldn't require Python classes" you mean "You don't have to map model classes into python classes" then yes, I agree, that is a very desirable part of anything we come up with. Let's avoid code generators and similar horrors as much as we can. Nobody likes those.

Let's consider that all Vizier tables come with such annotations, the same API code could that get many things:

  • Basic quantities (no significant gain I admit)
  • Complex quantities (e.g. complex errors)
  • Columns grouping
  • Status values
  • Associated data or services

I agree to all these use cases (except, as I said, even for basic quantities the gain is enormous because we can finally express frames, photometric systems, and the like in non-hackish ways).

But: which of these use cases would you miss with the non-entangled, explicit-reference models?

lmichel commented 3 years ago

So, perhaps a clarification: is my time series use case "single column annotation", and if so, why? What actual usage would go beyond what's possible there?

Any usage that mixes columns together (e.g. error matrix, columns grouping)

lmichel commented 3 years ago

we do in the model has concrete operational consequences. Which, mind you, is fine -- we'll have to deal with them somewhere and the DM is the right place for that.

If you change the model in a way that breaks the backward compatibility you will get concrete operational consequences whatever the way you associated model with data.

lmichel commented 3 years ago

...and I still cannot figure out why you want this -- after all, the point of the whole exercise IMNSHO is to add information to VOTables (and later perhaps other container formats) that is not previously in there.

At least 2 reason to targeting this:

  1. I would be happy if i could develop my client just by reading the model spec without fighting with VOTable elements (supposing that someone provided me with a low level library doing the dirty job)
  2. The comparison between 2 datasets is straighforward if the quantities 100% certified model compliant.
lmichel commented 3 years ago

-- but why would you want to do this JSON serialisation?

I DO NOT want this JSON serialisation. It is both an example for our discussion and a convenient way to exercice and to validate my proposal. JSON is however a convenient way to exchange data whatever their complexity. Let's imagine I spot a very intertesting source in my VOTable and I want to share it with another client (e.g. by SAMP). No doubt that the best way to do it would be to send a JSON MANGO (or whatever model) instance of that source.

lmichel commented 3 years ago

Let's avoid code generators and similar horrors as much as we can. Nobody likes those.

At least a clear point of agreement

lmichel commented 3 years ago

But: which of these use cases would you miss with the non-entangled, explicit-reference models?

For TS or spectra I send you back to the @mcdittmar responses

For the catalogcase, let's talk MANGO. I do not figure out what the MANGO entanglement level is, so just have a look at it.

Mango is a simple model with 2 docks (container).

The content of those docks is totally free (non-entangled components?)

The are designed in a way to carry any meta data we need to to perfecly describe any measure. So that, a Mango instances are self-consistant. If by some magic you need to handle some out of the VOTable scope (SAMP, datalink...) I'll expect them to be complete.

This is not false, but this is an annotation issue. If I've a unit in some model leaf, my annotation scheme must be able say that this unit comes from that FIELD. After this, resolving or not such references is the client business.

msdemlei commented 3 years ago

On Fri, Mar 19, 2021 at 11:01:29AM -0700, Laurent MICHEL wrote:

we do in the model has concrete operational consequences. Which, mind you, is fine -- we'll have to deal with them somewhere and the DM is the right place for that.

If you change the model in a way that breaks the backward compatibility you will get concrete operational consequences whatever the way you associated model with data.

Yes, but the question is: Will changing one model in this way take entire rest of the annotation with it or will the remaining annotation keep working? This is what the entanglement problem is about.

Additionally, in the explicit-annotation scheme, it's simple to keep the old annotation around (it's just one extra INSTANCE, and as you see insteresting columns can very simply have a dozen annotations), so there's no problem writing VOTables that just work for old and new clients as long as you still care to keep old clients operational.

msdemlei commented 3 years ago

On Fri, Mar 19, 2021 at 10:52:13AM -0700, Laurent MICHEL wrote:

So, perhaps a clarification: is my time series use case "single column annotation", and if so, why? What actual usage would go beyond what's possible there?

Any usage that mixes columns together (e.g. error matrix, columns grouping)

Could you recommend a specific one that I should tackle to show that this kind of thing is of course possible with explicit referencing?

lmichel commented 3 years ago

Could you recommend a specific one that I should tackle to show that this kind of thing is of course possible with explicit referencing?

lmichel commented 3 years ago

Yes, but the question is: Will changing one model in this way take entire rest of the annotation with it or will the remaining annotation keep working? This is what the entanglement problem is about.

IMO, the annotation must be faith to the model, but do not require the model to be totally mapped. Only data present in the dataset have to be mapped. The rest can (must) be ignored. The mapping block represents a subset of the model. If the model changes keep the backward compatibility, the 'old' annotations remain consistant and the interoperability between dataset mapped with different DM versions is preserved.

If you are saying that clients must be updated to take advantage of new model features, you are right, whatever the annotation scheme is, this is just because. new model class => new role => new processing.

msdemlei commented 3 years ago

On Mon, Mar 22, 2021 at 08:52:59AM -0700, Laurent MICHEL wrote:

Yes, but the question is: Will changing one model in this way take entire rest of the annotation with it or will the remaining annotation keep working? This is what the entanglement problem is about.

IMO, the annotation must be faith to the model, but do not require the model to be totally mapped. Only data present in the dataset

If this means that we need to be very careful with what attributes we make mandatory in our models, I totally agree.

have to be mapped. The rest can (must) be ignored. The mapping block represents a subset of the model. If the model changes keep the backward compatibility, the 'old' annotations remain consistant and the interoperability between dataset mapped with different DM versions is preserved.

Yes -- that's a minor version. These aren't a (large) problem, and indeed I'm claiming that our system needs to be built in a way that clients don't even notice minor versions unless they really want to (which, I think, so far is true for all proposals).

If you are saying that clients must be updated to take advantage of new model features, you are right, whatever the annotation scheme is, this is just because. new model class => new role => new processing.

No, that is not my point. My point is what happens in a major version change. When DM includes Coord and Coord includes Meas and you now need to change Meas incompatibly ("major version), going to Meas2 with entangled DMs will require new Coord2 and a DM2 models, even it nothing changes in them, simply to update the types of the references -- which are breaking changes.

With the simple, stand-alone models, you just add a Meas2 annotation, and Coord and DM remain as they are. In an ideal world, once all clients are updated, we phase out the legacy Meas annotation. The reality is of course going to be uglier, but still feasible, in contrast to having to re-do all DM standards when we need to re-do Meas).

msdemlei commented 3 years ago

On Mon, Mar 22, 2021 at 08:32:46AM -0700, Laurent MICHEL wrote:

Could you recommend a specific one that I should tackle to show that this kind of thing is of course possible with explicit referencing?

  • Column grouping here. This based on a real Vizier tabke

I'm afraid I don't really understand this use case: what are clients expected to do with this grouping information? Without that, it's hard to make any meaningful annotation.

Looking at your annotation, I'm wondering in particular: which client should consume the ucd, description and unit annotations from the INSTANCE-s rather than from the FIELD-s where they already are, and a lot more easily accessible?

  • Error matrix: here. This is based on a mock VOtable that I wrote to test my code. The real use case if Gaia and testing this feature on it is still planed

I've added an annotation to this table and made a PR. I still believe we don't have a credible use case for annotating covariances yet, which is why I'm using "meas2": Once clients start doing interesting things with DM annotations, we can, I believe, start thinking about doing tricks like these. Having said that, I've written code that uses this annotation to do something halfway interesting using my astropy annotation implementation:

https://github.com/msdemlei/astropy#working-with-covariance

lmichel commented 3 years ago

I'm afraid I don't really understand this use case

Looking at your annotation,

Again do not mix model and annotation

I've added an annotation to this table and ...

lmichel commented 3 years ago

No, that is not my point. My p

Continued in #24

msdemlei commented 3 years ago

On Fri, Mar 19, 2021 at 11:08:24AM -0700, Laurent MICHEL wrote:

At least 2 reason to targeting this:

  1. I would be happy if i could develop my client just by reading the model spec without fighting with VOTable elements (supposing that someone provided me with a low level library doing the dirty job)

Hm -- complicating things a great deal to perhaps simplify standards development a bit doesn't sound like a good deal to me.

Wouldn't you agree that out in the field, people should be taking the annotation from the VOTables? If what you're saying instead is "aw, VOTable is inconvenient, let's invent something else that people should be consuming", I'd become fairly nervous.

Mind you, there's nothing wrong with thinking of alternative representations of this stuff, and indeed, for DaCHS I'm already telling people to add the annotations in a quick and compact way -- http://docs.g-vo.org/DaCHS/ref.html#annotation-using-sil --, but that shouldn't drive our design. Let's not complicate matters even more by imagining we ought to magically fix, say, CSVs (where, of course, that's still possible by inventing a clever scheme in the spirit of FITS+, but that ought to be an afterthought).

  1. The comparison between 2 datasets is straighforward if the quantities 100% certified model compliant.

But wouldn't such a comparison happen in a client after it's parsed and deserialised the instances into whatever representation it chooses? Where would such an abstract "normalise-and-compare" operation play a role?

msdemlei commented 3 years ago

On Wed, Mar 24, 2021 at 08:03:45AM -0700, Laurent MICHEL wrote:

I'm afraid I don't really understand this use case

  • This is a Vizier usecase, more to say.

Yes, but what is the use case, i.e., what sort of functionality should be enabled? Without that, it'd exceedingly hard to say anything.

  • I repeated several time that the model must be self-consistance and independant of any particular dataset.

(see other issue #18)

  • I would say that the issues page is not the right place to question one of the use cases that have been proposed and validated about 2 months ago.

Perhaps, but I'd say a use case must be explicit on use, which I submit entails saying "A client wants to..." or something similar. Is there something like that for this grouping thing?

lmichel commented 3 years ago

See the Wiki post by Gilles. Some catalogs may have columns that give extra information about a particular quantity (e.g. quality flag, statistical sample size...). A client could hide such associated information at the first stage and then show them up on demand (e.g. with a tooltip)

Another use case, a bit aside, in shown here as an alternative to define (in)dependant axes. The independant axis is represented by a parameter and all the dependant axes are its associated parameters.

lmichel commented 3 years ago

Hm -- complicating things a great deal to perhaps simplify standards development a bit doesn't sound like a good deal to me.

I won't say that using annotations faith to the model is complicating things. It is rather the opposite.

Wouldn't you agree that out in the field, people should be taking the annotation from the VOTables?

Yes I do, I even plead for these annotations, read in the VOTable, to bear the structure of the model.

But wouldn't such a comparison happen in a client after it's parsed and deserialised the instances into whatever representation it chooses? Where would such an abstract "normalise-and-compare" operation play a role?

You are pointing the root of our disagreement: 1- You propose to let clients dealing with the model if they want to, and provide them the minimal stuff to do it.
2- I (with @mcdittmar ) propose to provide clients with model instances that can be parsed as such.

I do not say that you approach is not appropriate, but I claim that it makes the job more tough for clients for a little benefit whereas you way to consume data do work with my annotation sheme. This is not a good deal.

The way out of this discussion is likely somewhere in this topic

msdemlei commented 3 years ago

On Fri, Mar 26, 2021 at 01:08:44AM -0700, Laurent MICHEL wrote:

See the Wiki post by Gilles. Some catalogs may have columns that give extra information about a particular quantity (e.g. quality flag, statistical sample size...). A client could hide such associated information at the first stage and then show them up on demand (e.g. with a tooltip)

Gilles' original example has a clear use case -- that's measurement annotation with "plotting error bars" and in a bright future perhaps "automatic error propagation".

A Measurement model that does not needlessly multiply the number of classes by mingling in various sorts of physics covers this use case perfectly (and note that no work needs to be done if someone has a new sort of thing with errors in that case, and clients still don't have to put up with vague "columns related in some way" annotation).

The examples further down "limit flags or notes", "flags on magnitudes" can be trivially solved by a class (say) relatedData that I'd probably put into a source DM (but perhaps we could find a better place for that once we better understand where this kind of thing actually happens).

The annotation would then trivially be:

<INSTANCE dmtype="src:relatedData">
  <COLLECTION>
    <ITEM ref="themag"/>
    <ITEM ref="flag_on_the_mag"/>
  </COLLECTION>
</INSTANCE>

Should I put this into the dm-usecases repo? It almost seems a bit too trivial to me...

In http://viz-beta.u-strasbg.fr/viz-bin/Mango?-out.max=10&-source=I/322A/out&-out.all=1, it seems you're doing something very much different from grouping different columns. The associatedDataDock annotation looks more like an "associated link" thing, and I'd respectfully ask that you check again if this is something for DM annotation or if this wouldn't be much better addressed using Datalink -- it sucks for everyone if there's multiple ways to do (about) the same thing.

Another use case, a bit aside, in shown here as an alternative to define (in)dependant axes. The independant axis is represented by a parameter and all the dependant axes are its associated parameters.

Hm... to me, that's a bit of an argument against doing this. If this "related columns" thing lets people do what we thought ndcube should be doing, then I'd say one of the two should go.

lmichel commented 3 years ago

it seems you're doing something very much different from grouping different columns. The associatedDataDock annotation looks more like an "associated link" thing,

associatedDataDock has nothing to do with the associated parameters.