ivoa / dm-usecases

The is repo gathers all the material to be used in the DM workshop 2020
The Unlicense
1 stars 3 forks source link

Impact on the model change #24

Open lmichel opened 3 years ago

lmichel commented 3 years ago

This imporant issue comes in continuation of MANGO Annotation Scope.

It continues the discussion whose content is recalled here:

have to be mapped. The rest can (must) be ignored. The mapping block represents a subset of the model. If the model changes keep the backward compatibility, the 'old' annotations remain consistant and the interoperability between dataset mapped with different DM versions is preserved.

Yes -- that's a minor version. These aren't a (large) problem, and indeed I'm claiming that our system needs to be built in a way that clients don't even notice minor versions unless they really want to (which, I think, so far is true for all proposals).

If you are saying that clients must be updated to take advantage of new model features, you are right, whatever the annotation scheme is, this is just because. new model class => new role => new processing.

No, that is not my point. My point is what happens in a major version change. When DM includes Coord and Coord includes Meas and you now need to change Meas incompatibly ("major version), going to Meas2 with entangled DMs will require new Coord2 and a DM2 models, even it nothing changes in them, simply to update the types of the references -- which are breaking changes.

With the simple, stand-alone models, you just add a Meas2 annotation, and Coord and DM remain as they are. In an ideal world, once all clients are updated, we phase out the legacy Meas annotation. The reality is of course going to be uglier, but still feasible, in contrast to having to re-do all DM standards when we need to re-do Meas).

lmichel commented 3 years ago

I very concerned by the question of the model changes. But I cannot figure out how a class describing a measure could be upgraded in a way that breaks the backward compatiblity.

Do you have some example?

msdemlei commented 3 years ago

On Wed, Mar 24, 2021 at 08:16:55AM -0700, Laurent MICHEL wrote:

I very concerned by the question of the model changes. But I cannot figure out how a class describing a measure could be upgraded in a way that breaks the backward compatiblity.

Do you have some example?

Right now, Meas depends on on Coords, and given that that hasn't seen a lot of real-world usage, I'm rather sure we will want to fix things because they're just too clumsy. For instance, I suspect everyone will want time0 to just be a JD, and nobody will get it right if the offset and the value are in different frames (but that's really just an example, don't get hung up on this particular example too much).

If we change time0 to a float, that's a breaking change, and that would then take Meas with it because it references Coords.

I give you in that particular instance, we could fudge it by just adding a time0float attribute and tell people to not use time0, and we'd get away with it, but that's one more wart I'll have to apologise for when trying to get people to take up our standards. Planning on accumulating hacks like that is something I'd really try to avoid as long as there's an alternative that, for all I can tell, will work at least as well for the plausible use cases.

More generally: The assumption that we've gotten something right that has by and large not been consumed in practice seems dangerous to me. I'm saying that as someone who's been trying to fix things that turned out wrong in Registry (which, mind you, has been a surprisingly robust design). I'm mentioning rights in VOResource and, in particular, the whole caproles mess (https://ivoa.net/documents/caproles/).

lmichel commented 3 years ago

I cannot imagine a int -> real cast breaking anything. All clients are able do deal with this.
I remember an old discusssion about data typing in models and sometime I regret the absence of a number type in VODML.

I tried to figure out different situations where model changes would be be letal.

  1. Inapropriate downcasting (e.g. string -> numeric)
  2. Splitting classes (stupid e.g. use one object from RA and another for DEC instead on one for [RA,DEC])
  3. Merging classes (Coords attributes moved to Measure class)
  4. Removing things (no more link between CoordSys and CoordFrame)
  5. Renaming things (long -> longitude)

In the real life none is a real threat. The analogy with caprole must be carefully handle because our models are talking about physical quantities observed in the sky and caprole are talking about computer protocols that are completely abstract and flexible things.

Considering the worst, all of these cases would break the interoperability which represents a seriouser issue than encompassing model versionning:

I would even say that having encompassing model is safer, since the upgrade process garanties that all components still can work together both structuraly and semanticaly.

msdemlei commented 3 years ago

On Fri, Mar 26, 2021 at 04:14:12AM -0700, Laurent MICHEL wrote:

I cannot imagine a int -> real cast breaking anything. All clients are able do deal with this.

The time0-thing would be a change from a complex time instance to just a real, so that would hurt.

  1. Inapropriate downcasting (e.g. string -> numeric)
  2. Splitting classes (stupid e.g. use one object from RA and another for DEC instead on one for [RA,DEC])
  3. Merging classes (Coords attributes moved to Measure class)
  4. Removing things (no more link between CoordSys and CoordFrame)
  5. Renaming things (long -> longitude)

In the real life none is a real threat. The analogy with caprole must be carefully handle because our models are talking about physical quantities observed in the sky and caprole are talking about computer protocols that are completely abstract and flexible things.

Well, our models are just as flexible, and indeed going from STC1 to STC2 to MCT we've already seen two breaking changes (and, actually, many more in between). So, frankly, I don't buy the reasoning.

Considering the worst, all of these cases would break the interoperability which represents a seriouser issue than encompassing model versionning:

  • Data providers have to revise their annotation procedures
  • Client code have to deal with the new version and even worse, manage the cohabitation of both versions.

That's the flag day that I claim non-entangled models will prevent: Data providers have years or decades to migrate -- and most of their annotations won't even have to change, as they are unaffected from other model's version changes.

Clients can easily pick the most appropriate annotation they can deal with; this why the use case https://github.com/msdemlei/astropy#choosing-the-most-expressive-annotation is something I'd really like to see canonised...

Bonnarel commented 3 years ago

My 2 cents on this. What kind of situation can we imagine for a model change ? Where does it has impact when we are considering data transport using VOTable ?

Why is one (or several) of the models changing ? I can imagine two reasons:

lmichel commented 3 years ago

going from STC1 to STC2

Good example, I guess you agree on that the major concern about moving from STC1 to MCT is not the update of the host models (that do not exist any way). This is my point. Let's imagine that I've e.g. a model named lm_dm_1 that embeds stc1.13, Char_1 and dataset_1.5 Now stc1.13 has been moved to stc2: In consequence I've to update my model to lm_dm_2 (stc2, Char_1 , dataset_1.5). My points are

  1. upgrading lm_dm is not big deal
  2. using lm_dm_2 might be safer than using individual (stc2, Char_1 , dataset_1.5) instances since this guaranties that all three models are compliant each to other (e,g. vocabulary mismatches).
  3. last but mot least: lm_dm (1 or 2) keeps giving a complete description of the modeled objects (e.g. Cubes) out of the scope of any particular data container. I know you deny this requirement, but I insist to claim that this the interoperability key.
msdemlei commented 3 years ago

On Wed, Mar 31, 2021 at 02:05:18AM -0700, Bonnarel wrote:

My 2 cents on this. What kind of situation can we imagine for a model change ? Where does it has impact when we are considering data transport using VOTable ?

Why is one (or several) of the models changing ? I can imagine two reasons:

Well, the most common reason is: we simply did it wrong. As someone who did it wrong several times already, I think I'm entitled do say that. To mention my worst goof: Using ParamHTTP interfaces in TAPRegExt, which blew up badly more than half a decade later ("caproles"). It happened because I hadn't understood the full problem, didn't see the long-term consequences, and generally shunned the work of defining an extra type for TAP-like interfaces.

You could say other people are less lazy, think deeper, and see farther, and they probably do. But who knows, perhaps one day I'll make a DM, and then it'd be reassuring to know that if I get it as wrong as the interfaces in TAPRegExt, the VO can shed by mistake without taking the whole DM system with it.

  - The key thing is the independance of the annotation and of
  the data. 

Could you explain this a bit more? Do you mean lexical independence (e.g., annotation sits in an element of its own rather than, say, in FIELD's utype attributes)? Or semantic independence (in which case you'd have to explain how that should work)? Or yet something else?

To me, I'd say the annotation depends strongly, and ideally "injectively" on the data's structure (i.e., different structures will have different annotations) and not at all on the data (which rules out follies like having pieces of photometry metadata in column values).

Conversely, data and data structure do not depend at all on the annotation (which is less obvious than it may sound, but it in partiuclar means that you can attach as many different annotations to data structures as you like).

msdemlei commented 3 years ago

On Wed, Mar 31, 2021 at 07:05:13AM -0700, Laurent MICHEL wrote:

going from STC1 to STC2

Good example, I guess you agree on that the major concern about moving from STC1 to MCT is not the update of the host models (that do not exist any way). This is my point.

Not sure what "host model" means, but certainly no STC1 client will understand any STC2, even if we had an annotation scheme covering both (and, frankly, I think a really good STC model would deviate even further from STC1 than current Coords, but that's another matter).

Let's imagine that I've e.g. a model named lm_dm_1 that embeds stc1.13, Char_1 and dataset_1.5 Now stc1.13 has been moved to stc2: In consequence I've to update my model to lm_dm_2 (stc2, Char_1 , dataset_1.5). My points are

  1. upgrading lm_dm is not big deal

Not for you as the model author, of course. But for all of your clients. Which break, and cannot read/find Char_1 and dataset_1.5 metadata any more, although they'd be perfectly capable of understanding what's in there if it weren't for the changed container.

True, perhaps updating them would again be no big deal. Except that the client code still needs to be maintained and released and, most difficult of all, distributed. All of which are big deals in practice in a huge and diverse system like the VO.

This kind of, if you will, wanton destruction of existing and perfectly maintainable functionality is what I think most of this discussion is (or ought to be) about.

  1. using lm_dm_2 might be safer than using individual (stc2, Char_1 , dataset_1.5) instances since this guaranties that all three models are compliant each to other (e,g. vocabulary mismatches).

Umm -- could you be a bit more specific here? What annotation in a non-entangled Char_1 could possibly break when you change stc?

And since you're mentioning vocablaries, given we're in RFC for that I'd be particularly interested in your concerns about their interaction with DMs and their mutual compatibility.

  1. last but mot least: lm_dm (1 or 2) keeps giving a complete description of the modeled objects (e.g. Cubes) out of the scope of any particular data container. I know you deny this requirement, but I insist to claim that this the interoperability key.

Perhaps this is a point we'll have to discuss interactively, because I can't help feeling it's obvious that it's the other way round: If you kill all DM annotations for all existing clients if you just change a single DM incompatibly, that's then end of interoperability as soon as evolution sets in.

Sure, you can hope that models won't evolve, or you can hope that by that time all you have to replace are a few wep apps on AstroBook (TM) and hence you don't need iteroperability in the first place -- but both of these I'd consider rather undesirable.

mcdittmar commented 3 years ago

On this topic, I'm finding myself agreeing more with Markus. (ack!) To a certain level anyway, as I agree with Laurent's assertion 3 in this comment

Models are going to change, I think that's pretty much a given.

VODML compliant models import specific versions of their dependencies

This is true, whether a major version or minor version change.

What is the impact?

I don't think decoupling the models makes this go away:

Where it DOES have a big impact is on the annotation. This is probably a good case to mock-up (since we don't have multiple versions of any models) and annotate.
With the current model/annotation relations, I believe the client would need to:

If decoupled

If we are considering the case I'm currently working, Master(Source) with associated Detections(Source) and associated LightCurve(SparseCube), this could add up to serious real estate.

Up to now, I've considered all this "The cost of doing business.", and am comfortable with that position. But, after seeing the ModelInstance in Mango, maybe this needs more serious consideration. I had an idea this morning, inspired by this, which may be a good compromise. It could allow looser coupling of models, but still have definitive links between them for verification/interoperability. (ie: no ivoa:anyType stuff). Once I've thought that through a bit, I'll post it in a new thread.

lmichel commented 3 years ago

[@msdemlei] And since you're mentioning vocablaries, given we're in RFC for that I'd be particularly interested in your concerns about their interaction with DMs and their mutual compatibility.

Message understood.... ... but for the immediate time, the discussion is rather focused on DM concept itself.

Perhaps this is a point we'll have to discuss interactively..

Sure .I'm unable to connect the existence of Cube with the disaster that you announce

lmichel commented 3 years ago

I'm in line with the @mcdittmar's summary.

I've would just remind that we are talking about modeling physical entities. The comparison with what's happened with protocols (e.g. caprole) must considered with a lot of care.

My expectation is that the introduction of new players (e.g. radio) won't break existing stuff but introduce new patterns.

I'm pretty sure that changes on model components in a way that breaks backward compatibility (no example to give) won't be endorsed by data providers or client developers either.

Let's imagine that it happens anyway,

This bad situation would take place whether with the @msdemlei scheme, the @mcdittmar's one or mine. As I said some posts ago, generating e.g. a new CUBE VODML/XML won't the major difficulty to sort this case out.

mcdittmar commented 3 years ago

I'm in line with the @mcdittmar's summary. Yay! I always like hearing that!

My expectation is that the introduction of new players (e.g. radio) won't break existing stuff but introduce new patterns.

  • e.g. Radio dish FoVs seem a bit more complex that simple cones.

I think the most likely breaks will come from us having concrete objects defined in a model which we later find needs to be abstract in order to support branching by different domains. It is the main reason I have elements like the abstract Uncertainty type in the Measurements model.. to help guard against major version change. I grudgingly removed the abstract Point from Coords on the last iteration, and with this Mango work, we're finding an interest in restoring the space-centric LonLatPoint/SphericalPoint. This would be a major version update in Coords.

  • Data provider will have to annotate with both versions until all clients supports them (will likely never occur)
  • Clients will have to support both versions as soon as a one data provider use the new version (will likely occur). I was thinking about this, and it seems more likely (no evidence) that the clients would prefer to output as V1 OR V2 at the user's request, rather than annotating to both in the same output.
lmichel commented 3 years ago

I agree that the condition for the risk (as pointed by @msdemlei) of breaking models with new features to be very low is that models have abstract classes. i.e. things that can be extended without altering existing stuff.

MANGO showed up (too much apparently) the ability of extending MCT without breaking anything.

msdemlei commented 3 years ago

On Wed, Apr 07, 2021 at 05:59:07AM -0700, Mark Cresitello-Dittmar wrote:

I was thinking about this, and it seems more likely (no evidence) that the clients would prefer to output as V1 OR V2 at the user's request, rather than annotating to both in the same output.

Hm... First, it'll generally be the servers that output annotated data, with clients consuming them.

That has the consequence that annotated data may sit around for decades without being touched. And while it is true that on that time frame, it's clear that certain annotations won't be understood any more by new clients, with small, independent data models there's a good chance that most annotations will still work (Modern client: "Ah, this is a position in ICRS; but this meas:NaiveMeasurement thing that's in there I've forgotten about ages ago") whereas with the big God model it's virtually certain the new client will not recognise anything ("what's this ivoa-timeseries1:Root thing again?").

Conversely, clients have a way to hang around for decades in specialised workflows. Again, with small, independent models, they will keep understanding the majority of the annotations even if confonted with data produced long after they were written, whereas they'll be entirely broken on the first major change with the God model.

In the VO, you just can't "roll out version 2" -- you'll always have a wild mixture of modern and legacy services and modern and legacy clients, even 20 years from now. That's why it's so useful to limit the damage radius of breaking changes.

Bonnarel commented 3 years ago

On Wed, Mar 31, 2021 at 02:05:18AM -0700, Bonnarel wrote: My 2 cents on this. What kind of situation can we imagine for a model change ? Where does it has impact when we are considering data transport using VOTable ? Why is one (or several) of the models changing ? I can imagine two reasons: Well, the most common reason is: we simply did it wrong. As someone who did it wrong several times already, I think I'm entitled do say that. To mention my worst goof: Using ParamHTTP interfaces in TAPRegExt, which blew up badly more than half a decade later ("caproles"). It happened because I hadn't understood the full problem, didn't see the long-term consequences, and generally shunned the work of defining an extra type for TAP-like interfaces. You could say other people are less lazy, think deeper, and see farther, and they probably do. But who knows, perhaps one day I'll make a DM, and then it'd be reassuring to know that if I get it as wrong as the interfaces in TAPRegExt, the VO can shed by mistake without taking the whole DM system with it.

Well I think "do it wrong" is close to "does not allow an optimal interpretation". Of course this can always happen with everybody. This doesn't imply we have to let the client manage alone with the relationships between our break and pieces

  • The key thing is the independance of the annotation and of the data. Could you explain this a bit more? Do you mean lexical independence (e.g., annotation sits in an element of its own rather than, say, in FIELD's utype attributes)? Or semantic independence (in which case you'd have to explain how that should work)? Or yet something else? To me, I'd say the annotation depends strongly, and ideally "injectively" on the data's structure (i.e., different structures will have different annotations) and not at all on the data (which rules out follies like having pieces of photometry metadata in column values). Conversely, data and data structure do not depend at all on the annotation (which is less obvious than it may sound, but it in partiuclar means that you can attach as many different annotations to data structures as you like).

I clearly meant lexical independence. We can clearly imagine two strategies : either youy really map your data structure onto your model (and this requires a new schema each time you change the model - the things we did with xml schema distinct for each DM 15 years ago) or you add an (evoluating) mapping layer on top of more stable (VO)Tables.

Bonnarel commented 3 years ago

On Wed, Apr 07, 2021 at 05:59:07AM -0700, Mark Cresitello-Dittmar wrote: I was thinking about this, and it seems more likely (no evidence) that the clients would prefer to output as V1 OR V2 at the user's request, rather than annotating to both in the same output. whereas with the big God model it's virtually certain the new client will not recognise anything ("what's this ivoa-timeseries1:Root thing again?"). .........long after they were written, whereas they'll be entirely broken on the first major change with the God model. In the VO, you just can't "roll out version 2" -- you'll always have a wild mixture of modern and legacy services and modern and legacy clients, even 20 years from now. That's why it's so useful to limit the damage radius of breaking changes.

But we are not dealing with God models when speaking of TimeSeries or sparse Cubes or Source model with Parameters We have real situations which are meaningful cross projects and cross wavelengths (or even cross messengers) and want to interoperate them. Rapidly the way the things organize become complex. Very often we find tables with several different positions, times, magnitudes. Is there an independant time only and the other depend of ot like fluxes or whatever ? several (see ZTF and Beta Lyrae in Vizier examples ) ? are all the parameters independant (event list)? all but one (eg flux in a regularly sampled cube) ? Can we use the relationships between these parameters or axes to transform data from one data type to another one ? Providers may want to help users and clients do such things to compare or combine data. I imagine that with separate Cube-with-one-independant-axis-only and Cordinates annotation it will rapidly be a mess for the client to find its way.

msdemlei commented 3 years ago

On Thu, Apr 08, 2021 at 09:41:16AM -0700, Bonnarel wrote:

But we are not dealing with God models when speaking of TimeSeries or sparse Cubes or Source model with Parameters We have real

Well, as far as I can work out the idea is that there is one root node and everything else is then relative to it; it is this "there's one big class describing the whole of a document" is what I call God model.

My skepticism to them is not only aesthetic: Having them means that if you don't understand this root node, you can't use any annotation, and that a client that knows how to find, say, a value/error in a time series will have to be taught anew how to do it in an object catalogue (and trouble with versioning, and much more; there's plenty of good reasons why the God object is considered an antipattern).

Not to mention, of course, that few programmers will appreciate that you're trying to impose your data structures on them.

situations which are meaningful cross projects and cross wavelengths (or even cross messengers) and want to interoperate them. Rapidly the way the things organize become complex. Very often we find tables with several different positions, times, magnitudes. Is there an independant time only and the other depend of ot like fluxes or whatever ? several (see ZTF and Beta Lyrae in Vizier examples ) ? are all the parameters independant (event list)? all but one (eg flux in a regularly sampled cube) ? Can we use the relationships between these parameters or axes to transform data from one data type to another one ? Providers may want to help users and clients do such things to compare or combine data. I imagine that with separate Cube-with-one-independant-axis-only and Cordinates annotation it will rapidly be a mess for the client to find its way.

I like the concrete examples and questions, because with them you can test whether stuff works. And I contend all of the questions are rather straightforwardly answerable by the simple scheme I'm proposing over in https://github.com/msdemlei/astropy.

If you disagree, what sort of workflow do you think won't be covered by it?

Bonnarel commented 3 years ago

On Thu, Apr 08, 2021 at 09:41:16AM -0700, Bonnarel wrote: But we are not dealing with God models when speaking of TimeSeries or sparse Cubes or Source model with Parameters We have real Well, as far as I can work out the idea is that there is one root node and everything else is then relative to it; it is this "there's one big class describing the whole of a document" is what I call God model. My skepticism to them is not only aesthetic: Having them means that if you don't understand this root node, you can't use any annotation, and that a client that knows how to find, say, a value/error in a time series will have to be taught anew how to do it in an object catalogue (and trouble with versioning, and much more; there's plenty of good reasons why the God object is considered an antipattern). Not to mention, of course, that few programmers will appreciate that you're trying to impose your data structures on them. Well, there is an old consensus in IVOA that we are dealing with "datasets" or "dataproducts" and that dataproduct_type makes sense. A top level model is just some common description of the formal internal relationships between various parts of these data products consistent with the definition of the dataproduct type. DataProvider should succeed in agreeing on what is required and what is optional there. The constraint on application programmers will not come artificially from datamodelers but from DataProviders interoperability requirements

situations which are meaningful cross projects and cross wavelengths (or even cross messengers) and want to interoperate them. Rapidly the way the things organize become complex. Very often we find tables with several different positions, times, magnitudes. Is there an independent time only and the other depend of it like fluxes or whatever ? several (see ZTF and Beta Lyrae in Vizier examples ) ? are all the parameters independent (event list)? all but one (eg flux in a regularly sampled cube) ? Can we use the relationships between these parameters or axes to transform data from one data type to another one ? Providers may want to help users and clients do such things to compare or combine data. I imagine that with separate Cube-with-one-independant-axis-only and Cordinates annotation it will rapidly be a mess for the client to find its way. I like the concrete examples and questions, because with them you can test whether stuff works. And I contend all of the questions are rather straightforwardly answerable by the simple scheme I'm proposing over in https://github.com/msdemlei/astropy. If you disagree, what sort of workflow do you think won't be covered by it?

Well as far as I understood this works because the raw data are rather simple. But what would happen with a catalog like this : Shenavrin et al, Astronomicheskii Zhurnal, 2011, Vol. 88, No. 1, pp. 34–85. available in Vizier.

Here obviously there is one single independent time and the others parameters, including the other times depend of it. In addition there are several instances of TimeSeries in the same catalog (because there are several sources). Why shoud we discover all the times and then discover which one is the independent in another annotation ?

In the following catalog http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

msdemlei commented 3 years ago

On Wed, Apr 21, 2021 at 12:43:32AM -0700, Bonnarel wrote:

Well, there is an old consensus in IVOA that we are dealing with "datasets" or "dataproducts" and that dataproduct_type makes sense.

...which still makes it desirable that a "Position", say, works the same way regardless of dataproduct type. The little inconsitencies between how, say, SIAP and SSAP deal with positions have been driving our developers (including me) crazy for a long time. Let's not do more of that.

Well as far as I understood this works because the raw data are rather simple. But would happen with a catlog like this : Shenavrin et al, Astronomicheskii Zhurnal, 2011, Vol. 88, No. 1, pp. 34–85. available in Vizier.

That's https://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=J%2fAZh%2f88%2f34&-out.max=50&-out.form=HTML%20Table&-out.add=_r&-out.add=_RAJ,_DEJ&-sort=_r&-oc.form=sexa , right?

Here obviously there is one single independant time and the others parameters, including the other times depend of it. In addition there are several instances of TimeSeries in the same catalog (because there are several sources). Why shoud we discover all the times and then discover which one is the independent in another annotation ?

This is a catalogue, not a time series, unless I'm badly mistaken. What's in there is something like the photometry point we (I think) once had in SDM2 and that might make it to a PhotPoint class in PhotDM2. In which case this would look like this (to avoid shocking tag soup, I'm using SIL annotation (http://docs.g-vo.org/DaCHS/ref.html#annotation-using-sil), but it'll be the same in XML):

(phot2:PhotPoint) {
  value: @Jmag
  epoch: @JDJ
}
(phot2:PhotCal) {
  value: @Jmag
  bandName: J
  spectralLocation: 1.248e-6
}
(meas:Measurement) {
  value: @Jmag
  naiveError: @e_Jmag
  flag: @u_Jmag
}

(where I'm not yet convinced that flag in measurement is a good idea, but see parallel discussion with Gilles).

And so on for the other bands.

Or do I mis-understand your intention here?

In the following catalog http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

So, that would be

(ds:Dataset) {
  productType: event
}

(ndcube:Cube) {
  independentAxes: [@Arrival]
  dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

(stc2:Position) {
  ... (annotation for REJ2000, DEJ2000
}

What would you be missing?

Bonnarel commented 3 years ago

Hi Markus Le 22/04/2021 à 10:30, msdemlei a écrit :

On Wed, Apr 21, 2021 at 12:43:32AM -0700, Bonnarel wrote:

Well, there is an old consensus in IVOA that we are dealing with "datasets" or "dataproducts" and that dataproduct_type makes sense.

...which still makes it desirable that a "Position", say, works the same way regardless of dataproduct type. The little inconsitencies between how, say, SIAP and SSAP deal with positions have been driving our developers (including me) crazy for a long time. Let's not do more of that. OK. But are we not working on something which will make coordinates more stable ?

Well as far as I understood this works because the raw data are rather simple. But would happen with a catlog like this : Shenavrin et al, Astronomicheskii Zhurnal, 2011, Vol. 88, No. 1, pp. 34–85. available in Vizier.

That's https://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=J%2fAZh%2f88%2f34&-out.max=50&-out.form=HTML%20Table&-out.add=_r&-out.add=_RAJ,_DEJ&-sort=_r&-oc.form=sexa , right?

Yes but only table 3 !

Here obviously there is one single independant time and the others parameters, including the other times depend of it. In addition there are several instances of TimeSeries in the same catalog (because there are several sources). Why shoud we discover all the times and then discover which one is the independent in another annotation ?

This is a catalogue, not a time series, unless I'm badly mistaken. Table 3 is indeed a collection of TimeSeries See screenshot below What's in there is something like the photometry point we (I think) once had in SDM2 and that might make it to a PhotPoint class in PhotDM2. In which case this would look like this (to avoid shocking tag soup, I'm using SIL annotation (http://docs.g-vo.org/DaCHS/ref.html#annotation-using-sil), but it'll be the same in XML):

(phot2:PhotPoint) {
value: @Jmag
epoch: @JDJ
}
(phot2:PhotCal) {
value: @Jmag
bandName: J
spectralLocation: 1.248e-6
}
(meas:Measurement) {
value: @Jmag
naiveError: @e_Jmag
flag: @u_Jmag
}

I think we were not dealing with the same table. Table 3 has several times and plenty of raws for the same source. But the time are not independant between them. (because at a given time you observe at one band and then you rotate rapidlly the filter wheel and this gives you little time shifts with respect to the "main" time)

(where I'm not yet convinced that flag in measurement is a good idea, but see parallel discussion with Gilles).

And so on for the other bands.

Or do I mis-understand your intention here?

In the following catalog http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

So, that would be

(ds:Dataset) {
productType: event
}

(ndcube:Cube) {
independentAxes: [@Arrival]
dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

(stc2:Position) {
... (annotation for REJ2000, DEJ2000
}

I don't think there are dependant axes. It is a list of "VoEvents"-like things which occurs at some position, some time, some energy, etc..;

All these variables are independant with respect to each other

What would you be missing?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ivoa/dm-usecases/issues/24#issuecomment-824648857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMP5LTEF2B2H4NKRH4JLLQLTJ7NATANCNFSM4ZXO3FWQ.

msdemlei commented 3 years ago

On Thu, Apr 22, 2021 at 08:39:58AM -0700, Bonnarel wrote:

Le 22/04/2021 à 10:30, msdemlei a écrit :

That's https://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=J%2fAZh%2f88%2f34&-out.max=50&-out.form=HTML%20Table&-out.add=_r&-out.add=_RAJ,_DEJ&-sort=_r&-oc.form=sexa , right?

Yes but only table 3 !

Ah, ok.

The first part is that here, there would be several time series, so you'd say

(ndcube:Cube) {
independentAxes: [@JDJ]
dependentAxes: [@Jmag]
}

(ndcube:Cube) {
independentAxes: [@JDH]
dependentAxes: [@Hmag]
}

(ndcube:Cube) {
independentAxes: [@KDH]
dependentAxes: [@Kmag]
}

(this part is as for a normalised Gaia time series).

The "would be" above is because again this is highly de-normalised in that the table is (logically) a union of different (sets of) time series. You immediately notice that something is wrong with this table because sorting by Jmag, say, is physically meaningless. So, to make this a (set of) time series, you first need to do a relational selection.

I'm still claiming that if our annotation tries to include relational algebra, we still won't be able to fix the urgent use cases five years from now (don't laugh -- people have felt over-confident in the DM business fifteen and ten years ago, and we still have nothing).

Let's first get the annotation of actual, normalised tables right and then figure out if there's a point for doing structural modifications (normalisation) later.

Excursion: That later spec, I submit, should again refrain from re-inventing relational algebra and just embed a suitable subset of ADQL, perhaps like that:

(table:virtualtable) {
  id: timeseries-betLyr
  definition: "select * from table3 where name='beta Lyr'"
}

(table:virtualtable) {
  id: timeseries-iotaBoo
  definition: "select * from table3 where name='iota Boo'"
}

-- and these things would then need to receive annotations as if they were normal VOTable TABLEs.

But, really, really, let's postpone this. This is complicated, and we'll need years to get it right. And it'll make the standard so complex that we'll have a hard time getting people on board. Let's not forget the STC-1 lesson.

And until then, we can still annotate table3 with photometry, time metadata, and everything. It's just that users will have to manually pick out time series for the individual objects.

In the following catalog http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

So, that would be

(ds:Dataset) {
productType: event
}

(ndcube:Cube) {
independentAxes: [@Arrival]
dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

(stc2:Position) {
... (annotation for RAJ2000, DEJ2000
}

I don't think there are dependant axes. It is a list of "VoEvents"-like things which occurs at some position, some time, some energy, etc..;

All these variables are independant with respect to each other

Well, of course -- each dependent axis just depends on the independentAxes, not on each other. That's the definition. So, I'm not sure what your objection aims at -- perhaps if you stated what the annotation you are missing (?) would be enabling I could be a bit more specific?

In the way of an example, the use case for the present annotation is: "A plotting component wants to present the user with a sensible choice of what to plot against what" -- in this case, you'd always have Arrival on the abscissa, and any of Theta, E, or the coordinates on the ordinate. A particularly smart plotting programme would notice that RA and Dec make up a position and would offer to bring up a plot type that draws positions of time when the user wants to plot any of RA or Dec.

lmichel commented 3 years ago

Just some thoughs on the impact on the model changes. The @msdemlei client README, makes the assumption that the VOTable is delivered with distinct annotations for 2 different versions of the same model (Coords). I do not thing that this case is the most likely because the annotation process is a very though job (ask @gilleslandais) and I doubt that data curators will duplicate their efforts to support multiple variants of the same model.

The most likely situation is to have a client trying to put together (e.g. xmatch ) data sets annotated with different versions on the same model.

This a critical point that cannot be worked around just by using un-entangled models.

Bonnarel commented 3 years ago

Hi Markus, Le 23/04/2021 à 08:50, msdemlei a écrit :

On Thu, Apr 22, 2021 at 08:39:58AM -0700, Bonnarel wrote:

Le 22/04/2021 à 10:30, msdemlei a écrit :

That's

https://vizier.u-strasbg.fr/viz-bin/VizieR-3?-source=J%2fAZh%2f88%2f34&-out.max=50&-out.form=HTML%20Table&-out.add=_r&-out.add=_RAJ,_DEJ&-sort=_r&-oc.form=sexa , right?

Yes but only table 3 !

Ah, ok.

The first part is that here, there would be several time series, so you'd say

(ndcube:Cube) {
independentAxes: [@JDJ]
dependentAxes: [@Jmag]
}

(ndcube:Cube) {
independentAxes: [@JDH]
dependentAxes: [@Hmag]
}

(ndcube:Cube) {
independentAxes: [@KDH]
dependentAxes: [@Kmag]
}

(this part is as for a normalised Gaia time series).

If you read the paper, I don't think we can consider this as several monocolor time series. Of course we have as much TimeSeries as we have astronomical sources.

It is a complex TimeSeries with one independant Time (the one you want) and several complex "Parameters" made of Time and magnitude

Just because all the times on a row are very close to each other, they are part of  the same "observation" where you rotate the filter wheel.

If you look at this as a Time Series of a set of 5 such Paremeters there is a degenerescence on the time chosen as independant : it is mapped once as independant time and once as it's color dependant time.

The "would be" above is because again this is highly de-normalised in that the table is (logically) a union of different (sets of) time series. You immediately notice that something is wrong with this table because sorting by Jmag, say, is physically meaningless. So, to make this a (set of) time series, you first need to do a relational selection.

We cannot say they are wrong, I think this is one of the prerequisite of the workshop.

If you read the paper what they have done makes sense.

Take the tables as they are and try to annotate them  with a sufficiently smart model

I'm still claiming that if our annotation tries to include relational algebra, we still won't be able to fix the urgent use cases five years from now (don't laugh -- people have felt over-confident in the DM business fifteen and ten years ago, and we still have nothing).

I was not about relational algebra (apart from selecetin the source) on this one but trying to have a TS model able to manage such use case

Beside the independant TimeSeries we need structured dependant Properties. I think it's possible to do that using Mango Parameters.

Let's first get the annotation of actual, normalised tables right and then figure out if there's a point for doing structural modifications (normalisation) later.

Excursion: That later spec, I submit, should again refrain from re-inventing relational algebra and just embed a suitable subset of ADQL, perhaps like that:

(table:virtualtable) {
id: timeseries-betLyr
definition: "select * from table3 where name='beta Lyr'"
}

(table:virtualtable) {
id: timeseries-iotaBoo
definition: "select * from table3 where name='iota Boo'"
}

-- and these things would then need to receive annotations as if they were normal VOTable TABLEs.

Well the proposal in ModelInstanceIntoVOT is to do that using "GROUP_BY".

This GROUP_BY helps to define Instances  (which maybe instances of complex classes) below.

But, really, really, let's postpone this. This is complicated, and we'll need years to get it right. And it'll make the standard so complex that we'll have a hard time getting people on board. Let's not forget the STC-1 lesson.

And until then, we can still annotate table3 with photometry, time metadata, and everything. It's just that users will have to manually pick out time series for the individual objects.

I think Laurent's code allows to extract the TimeSeries of one source just by using this "GROUP_BY" feature

If we don't try to map these things will we force authors and providers to shape their tables as WE want them to do ?

In the following catalog

http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3 . All parameters have the same importance. It's an event list. Why should we not know that from the top ?

So, that would be

(ds:Dataset) {
productType: event
}

(ndcube:Cube) {
independentAxes: [@Arrival]
dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

(stc2:Position) {
... (annotation for RAJ2000, DEJ2000
}

I don't think there are dependant axes. It is a list of "VoEvents"-like things which occurs at some position, some time, some energy, etc..;

All these variables are independant with respect to each other

Well, of course -- each dependent axis just depends on the independentAxes, not on each other. That's the definition. So, I'm not sure what your objection aims at -- perhaps if you stated what the annotation you are missing (?) would be enabling I could be a bit more specific?

In the way of an example, the use case for the present annotation is: "A plotting component wants to present the user with a sensible choice of what to plot against what" -- in this case, you'd always have Arrival on the abscissa, and any of Theta, E, or the coordinates on the ordinate. A particularly smart plotting programme would notice that RA and Dec make up a position and would offer to bring up a plot type that draws positions of time when the user wants to plot any of RA or Dec.

I don't think Time has something special in this example. All parameters are INDEPENDANT,

You could also decide to see this as a  multidimensional Sky Cube where values are time, energy and teta

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ivoa/dm-usecases/issues/24#issuecomment-825432999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMP5LTHF2WJ67DBJ3JD4ZC3TKEKC7ANCNFSM4ZXO3FWQ.

msdemlei commented 3 years ago

On Mon, Apr 26, 2021 at 01:55:28AM -0700, Bonnarel wrote:

Le 23/04/2021 à 08:50, msdemlei a écrit :

(ndcube:Cube) {
independentAxes: [@JDJ]
dependentAxes: [@Jmag]
}

(ndcube:Cube) {
independentAxes: [@JDH]
dependentAxes: [@Hmag]
}

(ndcube:Cube) {
independentAxes: [@KDH]
dependentAxes: [@Kmag]
}

(this part is as for a normalised Gaia time series).

If you read the paper, I don't think we can consider this as several monocolor time series. Of course we have as much TimeSeries as we have

Why not? And what else would it be instead? How would you expect a client to deal with that thing?

It is a complex TimeSeries with one independant Time (the one you want) and several complex "Parameters" made of Time and magnitude

I'm afraid I still can't see what kind of code would consume an annotation saying about that much. Because you see, if there's no code to consume it, we shouldn't bother marking it up machine-readably (but that's just an excursion: I think there is a clear, machine-readable interpretation of this data; it's just that the table structure chosen is such that you need relational operations to pull it out, and thus my plea is to postpone this until we can at least reliably annotate actual ("normalized") tables).

Just because all the times on a row are very close to each other, they are part of  the same "observation" where you rotate the filter wheel.

The data providers could choose to have one time and then magnitude and delta-t as dependent axes, true. Both of the choices is rational and perfectly annotatable without relational algebra (and in particular my proposal).

What makes this table denormalised is that it is not a time series, it is several timeseries mixed together. I don't think it's surprising that such a thing cannot be annotated as a time series.

The "would be" above is because again this is highly de-normalised in that the table is (logically) a union of different (sets of) time series. You immediately notice that something is wrong with this table because sorting by Jmag, say, is physically meaningless. So, to make this a (set of) time series, you first need to do a relational selection.

We cannot say they are wrong, I think this is one of the prerequisite of the workshop.

If you read the paper what they have done makes sense.

That's not the point. The point is that their table is not a time series, and trying to somehow pretend it is is going to blow up our standard, both in time (which we don't have, 20 years into the VO and still unable to say "this is ra, dec in ICRS") and complexity (which comes at a high price, as evidenced by the failure of STC1).

I'm still claiming that if our annotation tries to include relational algebra, we still won't be able to fix the urgent use cases five years from now (don't laugh -- people have felt over-confident in the DM business fifteen and ten years ago, and we still have nothing).

I was not about relational algebra (apart from selecetin the source) on this one but trying to have a TS model able to manage such use case

Oh, but the operations you need to re-structure the tables are relational algebra, and once you start re-structuring tables in your annotation, you will re-discover all of Codd (1970): selection, projection, grouping, inner joins, left and right outer joins, complex conditions: You'll need it all, and we'll be busy for the next couple of years.

And given the experience with ADQL 1 (which was about the same thing: expressions of relational algebra written in XML) I expect in the end we will hate it and do what we did in ADQL 2: just use some subset of SQL as everyone else uses it.

If we don't try to map these things will we force authors and providers to shape their tables as WE want them to do ?

It's the other way round: Data providers are asking us: "How should we write our data such that topcat, astropy, and whatever else can optimally read them". And we should give them sound advice. Which is: Normalise your data, do not produce things with metadata that changes per row. That will make it a lot easier on everyone, them, their consumers, our implementors, and us as well.

If data providers see that machines will then understand their data a lot better, they won't feel "forced", they will feel well-advised. And for a good reason, because "avoid per-row metadata" is about the soundest advice you could give anyone writing tables.

Eventually being able to annotate legacy, de-normalsed data so it can be processed with modern, VO-compliant clients is then a nice thing, but it's not nearly as urgent. Also, that effort becomes a lot more attractive when those DM-aware clients exist. Which they don't right now, and which they won't until we put out a nice, easily implementable standard that lets them do interesting things.

http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3

. All parameters have the same importance. It's an event list. Why should we not know that from the top ?

(ds:Dataset) {
(ndcube:Cube) {
independentAxes: [@Arrival]
dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

I don't think Time has something special in this example. All parameters are INDEPENDANT,

You could also decide to see this as a  multidimensional Sky Cube where values are time, energy and teta

Ok -- given the context I had expected this to be a time series. You're right, physically, there's (almost) no point presenting it as one. It's an event list, and, frankly, I'd just annotate a position in there. What other annotation should a client use? What functionality would be enabled/triggered by that annotation?

Bonnarel commented 3 years ago

Le 26/04/2021 à 15:48, msdemlei a écrit :

On Mon, Apr 26, 2021 at 01:55:28AM -0700, Bonnarel wrote:

Le 23/04/2021 à 08:50, msdemlei a écrit :

(ndcube:Cube) {
independentAxes: [@JDJ]
dependentAxes: [@Jmag]
}

(ndcube:Cube) {
independentAxes: [@JDH]
dependentAxes: [@Hmag]
}

(ndcube:Cube) {
independentAxes: [@KDH]
dependentAxes: [@Kmag]
}

(this part is as for a normalised Gaia time series).

If you read the paper, I don't think we can consider this as several monocolor time series. Of course we have as much TimeSeries as we have

Why not? And what else would it be instead? How would you expect a client to deal with that thing?

It is a complex TimeSeries with one independant Time (the one you want) and several complex "Parameters" made of Time and magnitude

I'm afraid I still can't see what kind of code would consume an annotation saying about that much. Because you see, if there's no code to consume it, we shouldn't bother marking it up machine-readably (but that's just an excursion: I think there is a clear, machine-readable interpretation of this data; it's just that the table structure chosen is such that you need relational operations to pull it out, and thus my plea is to postpone this until we can at least reliably annotate actual ("normalized") tables). But MapIntoVOT interpretor by Laurent can create Json instances with that at least.

Just because all the times on a row are very close to each other, they are part of  the same "observation" where you rotate the filter wheel.

The data providers could choose to have one time and then magnitude and delta-t as dependent axes, true. Both of the choices is rational and perfectly annotatable without relational algebra (and in particular my proposal).

What makes this table denormalised is that it is not a time series, it is several timeseries mixed together. I don't think it's surprising that such a thing cannot be annotated as a time series.

ah, this is a matter of definition : I thought that TimeSeries were things were "something" (whatever it is) was varying with time.

If it is not the case and we are only dealing with "single-scalar" curves then the problem is really simplified.

And the two views : entanggled or not, are not that different in the simple case

But real life shows more complex things and a small extension of Mango allowed to encompass such Complex Time Series as this table 3 example

The "would be" above is because again this is highly de-normalised in that the table is (logically) a union of different (sets of) time series. You immediately notice that something is wrong with this table because sorting by Jmag, say, is physically meaningless. So, to make this a (set of) time series, you first need to do a relational selection.

We cannot say they are wrong, I think this is one of the prerequisite of the workshop.

If you read the paper what they have done makes sense.

That's not the point. The point is that their table is not a time series, and trying to somehow pretend it is is going to blow up our standard, both in time (which we don't have, 20 years into the VO and still unable to say "this is ra, dec in ICRS") and complexity (which comes at a high price, as evidenced by the failure of STC1).

I'm still claiming that if our annotation tries to include relational algebra, we still won't be able to fix the urgent use cases five years from now (don't laugh -- people have felt over-confident in the DM business fifteen and ten years ago, and we still have nothing).

I was not about relational algebra (apart from selecetin the source) on this one but trying to have a TS model able to manage such use case

Oh, but the operations you need to re-structure the tables are relational algebra, and once you start re-structuring tables in your annotation, you will re-discover all of Codd (1970): selection, projection, grouping, inner joins, left and right outer joins, complex conditions: You'll need it all, and we'll be busy for the next couple of years.

And given the experience with ADQL 1 (which was about the same thing: expressions of relational algebra written in XML) I expect in the end we will hate it and do what we did in ADQL 2: just use some subset of SQL as everyone else uses it.

If we don't try to map these things will we force authors and providers to shape their tables as WE want them to do ?

It's the other way round: Data providers are asking us: "How should we write our data such that topcat, astropy, and whatever else can optimally read them". And we should give them sound advice. Which is: Normalise your data, do not produce things with metadata that changes per row. That will make it a lot easier on everyone, them, their consumers, our implementors, and us as well.

If data providers see that machines will then understand their data a lot better, they won't feel "forced", they will feel well-advised. And for a good reason, because "avoid per-row metadata" is about the soundest advice you could give anyone writing tables.

Eventually being able to annotate legacy, de-normalsed data so it can be processed with modern, VO-compliant clients is then a nice thing, but it's not nearly as urgent. Also, that effort becomes a lot more attractive when those DM-aware clients exist. Which they don't right now, and which they won't until we put out a nice, easily implementable standard that lets them do interesting things.

http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=J/ApJ/790/L21&-to=3

. All parameters have the same importance. It's an event list. Why should we not know that from the top ?

(ds:Dataset) {
(ndcube:Cube) {
independentAxes: [@Arrival]
dependentAxes: [@Theta, @E, @RAJ2000, @DEJ2000]
}

I don't think Time has something special in this example. All parameters are INDEPENDANT,

You could also decide to see this as a  multidimensional Sky Cube where values are time, energy and teta

Ok -- given the context I had expected this to be a time series. You're right, physically, there's (almost) no point presenting it as one. It's an event list, and, frankly, I'd just annotate a position in there. What other annotation should a client use? What functionality would be enabled/triggered by that annotation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ivoa/dm-usecases/issues/24#issuecomment-826849646, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMP5LTCMHMWPXOOJWS6CCIDTKVVJZANCNFSM4ZXO3FWQ.

lmichel commented 3 years ago

Oh, but the operations you need to re-structure the tables are relational algebra, and once you start re-structuring tables in your annotation, you will re-discover all of Codd (1970):

We just have re-discovered use-cases:

These 3 statements correspond each to a specific use case (TS Gaia & ZTF and combined data). I assume there were good reasons to design these dataset as they are and it looks fair to propose a solution making them interoperable. This solution is rather light by the way (a couple of XML elements not breaking the mapping structure).

msdemlei commented 3 years ago

On Fri, Apr 23, 2021 at 08:37:26AM -0700, Laurent MICHEL wrote:

The @msdemlei client README, makes the assumption that the VOTable is delivered with distinct annotations for 2 different versions of the same model (Coords). I do not thing that this case is the most likely because the annotation process is a very though job (ask @gilleslandais) and I doubt that data curators will duplicate their efforts to support multiple variants of the same model.

Perhaps not, but for instance in cases of "oh, our serialisation library supports both anyway" they may. And they will, in particular, if it gives them a tanglible benefit ("Oh, we just have to add this instance declaration and our data still works with old TOPCATs").

But that, really, is a side show. The main point, and I'll not tire of stressing this, is that DMs can evolve independently. I'll try again: suppose we have phot2. Since timeseries1 contains references into phot, and that we don't have any more, it will have to become timeseries2 even if nothing changes anywhere else. The thing that I'm worrying most about is that in such a world, a timeseries1 client can't even find an entirely unchanged coords annotation any more.

In contrast, with isolated DMs, even if there is only phot2 (and not phot), a client that knows coords will still be able to work out which coordiates belong together and that they're in ICRS.

Given the entanglement comes at such a high cost, it would have to give us a great benefit. But, frankly, I'm not seeing even a minor one at this point.

The most likely situation is to have a client trying to put together (e.g. xmatch ) data sets annotated with different versions on the same model.

  • How to cross match data annotated with CoordsV3 against data annotated with CoordsV4.?
  • How to cross match data annotated with CubeV3 against data annotated with CubeV4.?

This a critical point that cannot be worked around just by using un-entangled models.

That assumes that the programmes just de-serialise your models and then work with that data structure. I really, really don't think that will be the case. I'm saying this as someone who really has tried keeping the STC1 data structure in his code.

You see, data structures are at the root of any non-trivial programme. It's hubris if we think we can make these choices for people.

What will in general happen is that people map our annotation into their internal data structures. That's what we ought to make easy.

And then the question just goes away: A programme "knows" a DM if it can map its annotation into its data structures. So, a programme that knows both coordsv3 and coordsv4 can fill its internal data structures from both annotations (and will, unless we mess it up, prefer coordsv4 if it can get it). And then it will work in its preferred way.

I'm not sure what you mean by cross matching cubev3 and cubev4; what functionality do you have in mind for that?

lmichel commented 3 years ago

I understand you point of view and I do not underestimate your arguments in favor of using individual model elements. You ask the right question: what is the advantage of using integrated model (sorry but entangled is a bit pejorative)?

Let me recap my points, already exposed sometime ago:

I'm not sure what you mean by cross matching cubev3 and cubev4; what functionality do you have in mind for that?

Cross processing (doing anything with both datasets together) would be more appropriated

msdemlei commented 3 years ago

On Wed, Apr 28, 2021 at 01:15:21AM -0700, Laurent MICHEL wrote:

I understand you point of view and I do not underestimate your arguments in favor of using individual model elements. You ask the right question: what is the advantage of using integrated model (sorry but entangled is a bit pejorative)?

This is probably a good place to summarise the main arguments in this question in the run-up to the workshop, so let me point out why I think these benefits do not weigh up the high cost of having dependencies between the models:

  • A model integrating components from other models will assure that all of those components are consistent each to others (vocabulary, roles at least) and that there is no risk of confusion (RFC validation).

Isolated models of course should have no dependencies between them, and thus there should be no question of them being consistent with each other. This being the real world, I admit we perhaps won't always get away with that, but I maintain it is prudent to at least try hard.

And sure, there are unavoidable dependencies that we simply will have to model as such and bit the bullet of dependencies. But the design goal should be to minimise those, because no amount of RFC validation will cope with the interdependencies between 10 models when one starts to move.

I'd be happy to discuss this futher on a concrete example on which you see a danger of inter-model inconsistencies.

  • Integrated models can describe complex datasets. I would be very sorry to have to say to some data provider: sorry but your dataset is too complex, I cannot annotate it. examples:

I think the reverse is true: If you have isolated models, you can annotate positions, photometry, provenance, whatever, in arbitrarily complex data that you simply cannot model interoperably (and it wouldn't make any sense either, since they're one-of-a-kind).

Still, your annotation will just work: Clients will be able to make out positions or photometry and do the right thing with them.

If, on the other hand, you just have a few "integrated" models (cube, source), and something doesn't quite fit, you're altogether out of luck, even if a, say, time in there would of course profit from an annotation as to what its time scale and reference position is.

That, incidentally, is the what my minimal use cases should have expressed.

  • detections attached to sources
  • Same properties but with different frames
  • Multi-objects data tables

I still have not seen a single use case on a normalized table that isolated data models couldn't cope with (nb. the question of having relational transformations as part of the annotation is distinct from the question at hand here). So, after my last few paragaphs, I'd say the chances that you can add meaningful annotation to complex datasets is higher with isolated data models.

3- Instances of integrated data models can be shared among different peers, e.g sending by SAMP Mango serializations of individual catalogue rows.

For one, you don't need integrated models for that -- just transmit whatever annotations you deem necessary. But then, I'm rather skeptical of this use case altogether, as it is perliously close to what the existing SAMP Mtype table.highlight.row does, and I think it's rarely a good idea to have two distinct mechanisms when the use case is hardly different (if at all).

4- Integrating components from other models does not mean dissolve them in the host model. They remain usable as such since they keep both integrity and dmtypes even within an integrated model. You client strategy can be applied on integrated data models as well (matter of XQuery). I've somewhere in my code a search_element_by_type(dmtype) method able to retrieve your coords2 class in any arbitrary nested model.

As long as that is true, I am almost happy. I am not entirely happy because we would still have the time bomb of the integrated models ticking in our system, waiting to blow the whole thing up when we do have to have a major version change in one of our data models.

Hence, for the workshop participants' benefit, let me briefly recap a few of the advantages of having small-ish, isolated DMs:

(1) Lesson from STC-1: data providers can adopt DMs one at a time (say, Coords first, then Meas, then Dataset, then Cube) -- and they get a bit more functionality in some client for each such annotation. People will panic if they see DMs with 70 boxes, in particular if there's no clear benefit in diggin in. They won't panic when there are just 5 boxes, and doing something with them already gives them some nice goodie in TOPCAT.

(2) Lesson from TAP validation: Look at slide 23 of the last euro-vo registry weather report: https://wiki.ivoa.net/internal/IVOA/InterOpNov2020Ops/20201118-Euro-VOResourcesValidationStatus.pdf -- essentially all TAP services in the VO are marked invalid (the yellow and red areas). Mind you: Almost all of these services work just fine when you go there with TOPCAT or pyVO. There's often just one little interoperability problem that nobody will stumble into in years that makes them fail.

An "integrated" DM has about the same complexity as TAP. There's no reason to believe it'll behave differently from TAP, and so you'll have a huge lake of invalid annotations where nobody can say if it's just a small detail that's wrong of if it's utter breakage.

So, let's plan instead such that valid instances are (a) humanly possible (i.e., in particular: small), (b) halfway stable, and (c) there's a small damage radius, i.e., a problem in one annotation doesn't make the whole thing invalid. Instead, a validator would report: "valid Coords, invalid Photometry, valid NDCube, invalid Dataset". And people would have a good idea what sort of functionality they can rely on and what's a little more shaky.

(3) Separate evolvability (that's discussed in some depth elsewhere, and we ignore it at our successors' peril).

(4) Flexibility and Consistency: A client that knows how to deal with Coords annotation can do that whether these Coords are in a time series, a catalogue, or a SIAP response -- always in the same way. And you don't need anything else at all: If it's a position, you can slap Coords annotation on it, and it'll just work. I'm not saying this isn't possible at all with your integrated models, it's just that I've not been able yet to actually write code using them that does this (help is gratefully appreciated).

(5) Small, isolated models, I claim, are what most programmers what: Let's give them Lego bricks rather than pre-assembled models. The Legos are going to fit much better into their programmes.

lmichel commented 3 years ago

This is probably a good pl

sure

interdependencies between 10 models

Little exaggeration?

Our positions are not converging at all, there is no need to run a new discussion loop.

I would just like to repeat what I wrote 2 weeks ago. Technically, the actual annotation scheme works for any model granularity, hence your proposal is actually to ask the VO not to RECommand integrated models. This is what I do not agree with.

lmichel commented 3 years ago

Hence, for the workshop participants' benefit, let me briefly recap a few of the advantages of having small-ish, isolated DMs:

My answers

(1) Lesson from STC-1...

Not really applicable here.

(2) Lesson from TAP validation:

The comparison with TAP is unfair. TAP is a complete database infrastructure encompassing all VO fields: Nothing to do with a simple measure container as MANGO is.

(3) Separate evolvability

That's discussed in some depth elsewhere

(4) Flexibility and Consistency:

I agree with you, our 2 approaches do work.... for the simplest cases. Have you tried to figure out (on paper at least) whether your code design could be applied to a MANGO mapping?

(5) Small, isolated models

As the impact of model changes has already been discussed many time here, I prefer to have a little fun with your Lego metaphor: When I was young, I spent a lot time playing with Lego bricks. At that time Lego was mostly sold as brick boxes, but year after year the company has marketed more and more complex (entangled) objects (Star War ships, robots..) with a growing success. Just to warn you against this sort of comparison :-)

msdemlei commented 3 years ago

Let's continue most of this at the workshop(s), but just one thing:

On Wed, May 05, 2021 at 07:48:39AM -0700, Laurent MICHEL wrote:

(4) Flexibility and Consistency:

I agree with you, our 2 approaches do work.... for the simplest cases. Have you tried to figure out (on paper at least) whether your code design could be applied to a MANGO mapping?

I believe most of MANGO isn't so far from what I'm proposing; as far as I can see, you only need to drop the top-level grouping, and you'll have your isolated annotations.

But it's much more productive to talk about use cases: So, in the run-up of the workshop it would be good to have one or two of these complex use cases that might not work wir isolated models. Any chance you could give a sketch of them in the form of "A client wants to do X, and it needs Y and Z for that"?

When I was young, I spent a lot time playing with Lego bricks. At that time Lego was mostly sold as brick boxes, but year after year the company has marketed more and more complex (entangled) objects (Star War ships, robots..) with a growing success. Just to warn you against this sort of comparison :-)

Well, you got me there on the marketing. But then we don't have the advertising budget of Lego, and hence I still believe if we want folks to buy our stuff, we'll have to go for their recipes when they didn't have those budgets either...

lmichel commented 3 years ago

I believe most of MANGO isn't so far from

This is true as long as you do not have associated data. This is also why I'm repeating that our difference on this topic is a matter of Xpath For the workshop I'll insist on the complex data especially. I'm even more motivated for this since Yesterday when we had a long meeting with exoplanet people who are asking for mapping highly connected data and even for JSON serializations.

Well, you got me there on the marketing ..

:thumb up: