Baseline use cases - Githubissues

ivoa / dm-usecases

The is repo gathers all the material to be used in the DM workshop 2020

The Unlicense

1 stars 3 forks source link

Baseline use cases #36

Open msdemlei opened 3 years ago

msdemlei commented 3 years ago

I've always assumed the following has been the (rough) consensus agenda behind the effort, but I'm starting to realise that that is not necessarily the case, so I'm trying to raise the following (IMHO) baseline use case as an explicit issue.

"Let a client figure out non-VOTable metadata for a column independently of whether they are in a time series, a spectrum, or some generic table."

Examples (all assuming "if present in the table", of course):

If I see an ra, figure out the corresponding dec
If I have an ra/dec, figure out the reference system, the reference position, the epoch
If I have a position, figure out an associated proper motion
If I have any sort of quantity, figure out the/a error (and, eventually its sort, perhaps one day distribution paraemters, etc)
If I have a flux or magnitude, figure out the band it's in, perhaps a zeropoint, etc.

Again, and that is important: Clients must not need to know a full data model (e.g., "timeseries") to do that. The rationale behind that is that client writers will curse us if they will have to re-implement "find an error for a value" again and again just because there's a new product type. And the data providers will curse us if they cannot annotate a value/error pair just because it's, say, a redshift and there's no "data model" for that.

I'll note here I will have a very hard time singing off to anything that does not show code that actually lets clients do these things and that will credibly make it to astropy and/or STIL.

I'll shamelessly link again to https://github.com/msdemlei/astropy, which is an astropy fork that can pull such information from annotation produced right now by DaCHS' time series code. An analogous astropy (or STIL, if you prefer) fork for the other proposals would really help to give me some confidence in them.

I'll freely admit that my fork clearly isn't ready for PR yet (it was hacked in 1 1/2 afternoons). But I claim, can be made ready for a PR with ~ a week of concentrated work, with write and programmatic annoation support possible in perhaps another week, and something like that would of course be enough to convince me of the other proposal's viability (as long as all the features proposed are exercised, which is particularly true for the SQL-in-XML part if you do insist on that).

Sorry for being a bit obstinate here, but I have apologised far too often that 20 years into the VO we still cannot give a reliable recipe for "what's the dec for this ra", and if we again fail to create that it would be a real shame.

After the introduction: Do people roughly agree these are use cases that our annotation, and the models, ought to support?

gilleslandais commented 3 years ago

I am agree that more simple is the output more it will be adopted by clients. However, I don’t think that just gather a quantity with its error is enough – in the example I sent previously (https://github.com/ivoa/dm-usecases/wiki/mango) , I listed some examples of tables having columns with associated error, flags, ref..

why Mango ? For me, the most important is that Mango is capable to expose measures – measure which can be composed by a column and associated columns like error, flags, etc. (and each measure even if it is a collection of columns is defined with a couple UCD+semantic – easily understandable by clients). It is well adapted for VizieR tables – enough generic also. Is it also adapted for other data-centers ?

With Mango, you can annotate with generic measure or you can use a DataModel (in my examples , I use the both: for position and photometry I use DM, and for other quantities GenericMeasure). For me the question is more: what are the (trusted) DM expected in Mango ?

Note that I am not a fan of “everything in DM” because my feeling is that they are more complex to manage/parse – but a minimum why not ? (may be needed for photometry and always with the condition that nothing is required)

About complex structure (like timeseries): for me it is a bonus – possible with Mango!

You have an astropy version to parse vodmlinstance ? That is a good new - It could be use to demonstrate the benefit of the current work – could we imagine a mango extension possible :-) (limited in a first step to the Mango core) ?

msdemlei commented 3 years ago

On Fri, Apr 16, 2021 at 09:25:12AM -0700, gilleslandais wrote:

I am agree that more simple is the output more it will be adopted by clients. However, I don’t think that just gather a quantity with its error is enough – in the example I sent previously (https://github.com/ivoa/dm-usecases/wiki/mango) , I listed some examples of tables having columns with associated error, flags, ref..

Yeah, but as I wrote there: we need concrete use cases to figure out if and how to annotate such columns. Annotation is there for clients, and if they can't do anything with the annotation, it's worse than useless (because it may lead to confusion when, at a later time, an actual use case is identified; cf. caproles).

Let me try for these flag-like columns:

(1) Error-like flags (e.g., "uncertain"). These should probably be dealt with in Meas; doing anything machine-readably with them might, however, be so hard that it may not be worth it.

(2) For anything else, the only real use case I can imagine would be "When showing a subset of columns, show and hide this set of columns as one". This sounds like "presentation-oriented" annotation, for which we so far have to DM. Perhaps there should be one?

What other cases are there for these groups? What purpose would the machine-readable annotation have for them?

Oh, and also note that for VizieR-internal use (where interoperable clients can just look the other way) you can always add private annotation like, perhaps:

<INSTANCE dmrole="viz-pres:ColGroup">
  <ATTRIBUTE dmrole="header" value="All about V mag"/>
  <ATTRIBUTE dmrole="group_leader" ref="vmag"/>
  <ATTRIBUTE dmrole="group_members">
    <COLLECTION>
      <ITEM ref="vmag_quality"/>
      <ITEM ref="vmag_source"/>
      <ITEM ref="vmag_err"/>
      <ITEM ref="vmag_flag"/>
    </COLLECTION>
  </ATTRIBUTE>
</INSTANCE>

why Mango ? For me, the most important is that Mango is capable to expose measures – measure which can be composed by a column and associated columns like error, flags, etc. (and each measure even if it is a collection of columns is defined with a couple UCD+semantic – easily understandable by clients). It is well adapted for VizieR tables – enough generic also. Is it also adapted for other data-centers ?

Well, if, as you say, machines should be aware of these things when processing a measurement, then I argue whatever it is should go into Meas -- if it's needed interoperably at all.

With Mango, you can annotate with generic measure or you can use a DataModel (in my examples , I use the both: for position and photometry I use DM, and for other quantities GenericMeasure).

Yes -- except I still don't understand what you win by saying "this is a special measurement that can have photometry metadata" over just attaching the PhotDM metadata when it's there to a normal Measurement. My usual question: what can a machine do with this extra annotation that it cannot do with just the PhotDM annotation?

For me the question is more: what are the (trusted) DM expected in Mango ?

You mean: what DMs can a client rely on? I am totally sure that no client can ever rely on any particular annotation in everything it sees.

As I point out in https://github.com/msdemlei/astropy (look at the README), it's a lot better if clients look for annotation they need when they need it (and fail in a defined way when it's not there). And of course you can convey the notion of, say, time series in that scheme rather plainly (look at the time-series, md-proposal).

About complex structure (like timeseries): for me it is a bonus – possible with Mango!

...as with anything else currently discussed.

You have an astropy version to parse vodmlinstance ? That is a good

Well, not the gruesome 25-element proposal, but something with six new elements (MODEL and children should go, too: they're just INSTANCE-s, too). And yes, that's what https://github.com/msdemlei/astropy is. Again, see its README for how this intended to be used.

new - It could be use to demonstrate the benefit of the current work – could we imagine a mango extension possible :-) (limited in a first step to the Mango core) ?

Since I have doubts there is such a benefit, I'd love to look into this. What kind of functionality would you be interested in?

lmichel commented 3 years ago

I believe that the use-case agenda hasn't changed:

Showing whether MCT/PhotDM cover our need
Adopting a mapping strategy
Adopting a mapping syntax

Honestly, IMO putting forward a library prototype does not help that much for the discussion.

I won't rehash my argument in this post but I'll ask the true question which relates to the exact scope of the Markus's proposal.

The annotation block starts with a list of the models on which data are mapped.

If that list contains Meas and Coord, clients will understand they can find both Meas and Coord instances in the mapping block.
If that list contains Meas, Coord and Cube, clients will understand they can find Meas , Coord and Cube instances in the mapping.
- They are is free to ignore classes of one of these models (likely Cube for the Markus's client)
- The syntax is clean enough to easily retrieve any object by type or role or both (e.g. looking for POSITION instances within a CUBE). It's a matter of XPath.

This clearly shows that technically both approaches can live together. I come to the conclusion that the real question asked by @msdemlei is to say whether the IVOA have to get ride of integrated models (CUBE, MANGO,..) or not.

If YES: this must be set as a headline of the workshop.
If NO: let's consider that both are valid and let's refine the syntax to facilitate the job for all. This question is not only a matter of annotation

msdemlei commented 3 years ago

On Mon, Apr 19, 2021 at 05:57:02AM -0700, Laurent MICHEL wrote:

I believe that the use-case agenda hasn't changed:

Showing whether MCT/PhotDM cover our need

Adopting a mapping strategy

Adopting a mapping syntax

Honestly, IMO putting forward a library prototype does not help that much for the discussion.

Well, it would for me; as I said, I cannot see at this point how ordinary clients will consume these complex data models in the VO (which is a slow and dirty place of long-lived services and clients) to satisfy these simple use cases.

But if I get shown plausible ways how they can do that with a reasonable effort, I'll stop whining immediately.

By the way, writing such a thing is effort well spent: After all, we're doing all this for the consumers, and seeing how they'll do the consuming in common cases is very informative all around (let me also mention that getting an impression of the ease of implementation is a good thing, too).

The annotation block starts with a list of the models on which data are mapped.

If that list contains Meas and Coord, clients will understand they can find both Meas and Coord instances in the mapping block.

If that list contains Meas, Coord and Cube, clients will understand they can find Meas , Coord and Cube instances in the mapping.

Oh... you mean, clients should go for the MODEL declarations? Frankly, I doubt that's a good idea, because a single model may contain all kinds of classes, and a model declaration as such doesn't tell you which of them might be used (since I've just played with it today, see ProvDM: what would a client learn from knowing there's ProvDM annotation in a VOTable?).

Conversely, I'd be interested in your opinions about the API I'm proposing over at https://github.com/msdemlei/astropy (get_annotations(dmtype), iter_annotations()). Are you saying something like that won't work for MANGO? Or just isn't suitable?

They are is free to ignore classes of one of these models (likely Cube for the Markus's client)

Well, I'd put it the other way round: They'll look at the annotation they need for a given task; and of course "my" client would look at cube annotation, too, when figuring out what to plot against what (say).

This clearly shows that technically both approaches can live together. I come to the conclusion that the real question asked by @msdemlei is to say whether the IVOA have to get ride of integrated models (CUBE, MANGO,..) or not.

This is related to the question I'm asking, yes. But I'd not worry about God models (note: Cube doesn't need to be one; if it just points to the dependent and independent axes and leaves the rest to other DMs, it'll be a very compact and tidy DM) if there is a clear way in which to satisify these basic use cases in spite of them, and with a clear vision of what happens if we do have a major version change in one of our fundamental DMs.

If YES: this must be set as a headline of the workshop.

If NO: let's consider that both are valid and let's refine the syntax to facilitate the job for all. This question is not only a matter of annotation

I also suspect that entangled DMs cannot be fixed in the annotation, which is where the thread last October came from.

But perhaps that is just me -- which is why some sample implementation would help me a lot. If entangled models can be done in a way that don't impact client operations for the use cases presented, there's not much I'm worried about (but I'm still rather convinced programmers like Lego bricks better than pre-assembled models).

So, yes, I would propose "Bricks or Death Stars" as the workshop's headline.

mcdittmar commented 3 years ago

Markus,

Have you looked at the implementations done so far? I feel that a lot of these questions are addressed there.

Again, and that is important: Clients must not need to know a full data model (e.g., "timeseries") to do that. The rationale behind that is that client writers will curse us if they will have to re-implement "find an error for a value" again and again just because there's a new product type. And the data providers will curse us if they cannot annotate a value/error pair just because it's, say, a redshift and there's no "data model" for that.

There is no 're-implement "find an error for a value" again and again'. With the model re-use, if you understand meas:Measure, you know where the errors are in any model that uses it. These examples become rather too trivial to be interesting...

Examples (all assuming "if present in the table", of course): If I see an ra, figure out the corresponding dec these are part of the same instance, so if you've found the instance containing RA, you already have the DEC. pos = doc.find_instances(LonLatPoint)[0] ra = pos.longitude dec = pos.latitude

If I have an ra/dec, figure out the reference system, the reference position, the epoch frame = pos.coord_sys.frame.space_ref_frame epoch = pos.coord_sys.frame.equinox

If I have a position, figure out an associated proper motion I know Astropy combines the position and proper motion elements together in the same object.. So far, the models define no specific association between them. With 'Mango', you could have a Source with Position and ProperMotion as associated properties. In the rama package, it converts meas:Position to astropy SkyCoord whenever possible, so if there is an associated proper motion element as well, it could be incorporated into that instance.

If I have any sort of quantity, figure out the/a error (and, eventually its sort, perhaps one day distribution paraemters, etc) I'm assuming you don't mean the ivoa:Quantity.. which is not something that would be grabbed individually prop = doc.find_instance() err_type = type( prop.error.stat_error )

If I have a flux or magnitude, figure out the band it's in, perhaps a zeropoint, etc. This is a bit complicated by the fact that PhotDM is not tied into the other models, so there is no coordinate system which includes photometry metadata. That was a target of the SpectralDM work. Mango has a stub in place to work with.. prop = doc.find_instance(Photometry) band = prop.coord.coord_sys.frame.name zeropoint = prop.coord.coord_sys.frame.zeroPointFlux

Re: Prototype library..

Omar's rama package is a prototype library, available on GIT, uses standard VOTable parser to interpret the VOTable, and custom parser for the Annotation..
it auto-converts Position and Time instances to astropy SkyCoord and Time objects when possible.
presents an API to the user (see above commands; and any of the example implementation pages)
- as I mentioned in your other thread, has a similar front-end to yours.
- often following the model class structure
- but can be extended with helpers for convenience.
  - eg implemented in rama: conversion to astropy, plotters, shortcuts from cube axes -> data
I've always expected it to become publicly available once the model/annotation questions are resolved.. the way things are going, probably as a PyVO package.

One of the questions I've had with the 'just search for what you are interested in' approach is illustrated in the standard_properties case:

Table with Master Source properties
Table with Detections (also Source instances) If I open that file and do sources = doc.find_instances(Source) I get ALL the Source instances.. master and detections.. there is very little that I can reasonably do with that. Same for many of the base elements pos = doc.find_instances(Position) returns all the Source positions, Detection positions, Target Position, possibly the Reference Position of the satellite, etc.. any Position in the file. The approach seems too loose for anything but the simplest cases.

msdemlei commented 3 years ago

On Tue, Apr 20, 2021 at 03:22:57PM -0700, Mark Cresitello-Dittmar wrote:

Have you looked at the implementations done so far? I feel that a lot of these questions are addressed there.

Presumably not nearly the latest ones, more like Mango of a year ago and Omar's code as of Shanghai or so, and I have to admit I have lost track of what goes on where.

Perhaps you could point us to the current version(s) of the packages you have mentioned? Having a concise overview over where to look (in which branches where applicable) is a good thing with a view to the workshop, too. Of course, I'd be particularly grateful if brief docs on solving the mentioned use cases (which clearly are not trivial, as evinced by the fact that we still can't do them 20 years in the VO) could easily be located.

Bonnarel commented 3 years ago

@msdemlei wrote

This is related to the question I'm asking, yes. But I'd not worry about God models (note: Cube doesn't need to be one; if it just points to the dependent and independent axes and leaves the rest to other DMs, it'll be a very compact and tidy DM) if there is a clear way in which to satisify these basic use cases in spite of them, and with a clear vision of what happens if we do have a major version change in one of our fundamental DMs.

Cube is much more than that, as far as I remember. Dependent and independent axes fit well with TimeSeries and Spectrum dataproduct types, but in ND Cubes all but one axes are independant and in event list everything is independant. Important and different question : are the axes sparsed or regularly sampled ? event list are sparsed, TimeSeries may be, ND cubes are regular, etc...

If YES: this must be set as a headline of the workshop. - If NO: let's consider that both are valid and let's refine the syntax to facilitate the job for all. This question is not only a matter of annotation I also suspect that entangled DMs cannot be fixed in the annotation, which is where the thread last October came from. But perhaps that is just me -- which is why some sample implementation would help me a lot. If entangled models can be done in a way that don't impact client operations for the use cases presented, there's not much I'm worried about (but I'm still rather convinced programmers like Lego bricks better than pre-assembled models). So, yes, I would propose "Bricks or Death Stars" as the workshop's headline.

Of course if you change one of the models the annotation will change and clients will have to adapt. But this is true wether the models are entangled or not. Or there is something I do not understand

mcdittmar commented 3 years ago

On Wed, Apr 21, 2021 at 3:57 AM msdemlei @.***> wrote:

On Tue, Apr 20, 2021 at 03:22:57PM -0700, Mark Cresitello-Dittmar wrote:

Have you looked at the implementations done so far? I feel that a lot of these questions are addressed there.

Presumably not nearly the latest ones, more like Mango of a year ago and Omar's code as of Shanghai or so, and I have to admit I have lost track of what goes on where.

I was referring to these workshop cases...

Perhaps you could point us to the current version(s) of the packages you have mentioned? Having a concise overview over where to look (in which branches where applicable) is a good thing with a view to the workshop, too.

Each of my implementation pages have a summary of the packages used, with pointers. e.g., https://github.com/ivoa/dm-usecases/tree/main/usecases/time-series/mcd-implementation

Of course, I'd be particularly grateful if brief docs on solving the mentioned use cases (which clearly are not trivial, as evinced by the fact that we still can't do them 20 years in the VO) could easily be located.

Perhaps my point is that with the data properly annotated to the current models, these operations become trivial. With the 'Death Star' approach, there is no 'figuring out' where something is, you simply follow the path (modulo API helpers). With the 'Bricks' approach, you have Coordinate A using FIELD x.. you have to find Measure B also using FIELD x.

I'd be happy to put a page together that goes through that set of actions specifically, but I'm sure each is covered within the various cases (module position-propermotion maybe).

msdemlei commented 3 years ago

On Wed, Apr 21, 2021 at 06:37:48AM -0700, Mark Cresitello-Dittmar wrote:

Of course, I'd be particularly grateful if brief docs on solving the mentioned use cases (which clearly are not trivial, as evinced by the fact that we still can't do them 20 years in the VO) could easily be located.

Perhaps my point is that with the data properly annotated to the current models, these operations become trivial. With the 'Death Star' approach, there is no 'figuring out' where something is, you simply follow the path (modulo API helpers).

But that's different each time depending on whether you have a time series, a catalogue, or yet something else, right? So, a client that doesn't know SparseCube but just PhotDM would be out of luck when trying to work out the, say, band of the magnitudes in

https://github.com/ivoa/dm-usecases/blob/main/usecases/time-series/mcd-implementation/gavo_summary.md

right?

It is this kind of "new code per high-level DM" thing that I'd really love to avoid, because it will be hard sell on our clients (I'll mention again I'm only here because we failed to sell STC1). And it'll be an even tougher sell once we change the first basic DM and all these high-level DMs start to break.

I'd be happy to put a page together that goes through that set of actions specifically, but I'm sure each is covered within the various cases (module position-propermotion maybe).

Yeah, that would help me a lot. Just putting up the various steps I'll have to do until the code on the gavo_summary.md actually runs would give me a much better idea of what all this entails.

But, frankly, what I'm really after is code that satisfies the precise use cases in the original issue: "use annotation X" if it's there, where X is Meas, Coord, or Phot -- and do that regardless of what "sort" of data one is dealing with.

msdemlei commented 3 years ago

On Wed, Apr 21, 2021 at 06:13:10AM -0700, Bonnarel wrote:

@msdemlei wrote Of course if you change one of the models the annotation will change and clients will have to adapt. But this is true wether the models are entangled or not. Or there is something I do not understand

The question, again, is: When you change, say, Meas to Meas2, will this only break uses of Meas, or will it take Coord, Dataset, Cube, Spectrum and everything else with it?

With entangled models, Meas2 will immediately mean Coord2, Dataset2, Cube2, Spectrum2, and so on, so a legacy client will no longer understand any annotation at all. It won't, for instance, find the reference system for a position, although if it found it and boldly interpreted it it would actually work. Client developers will not be amused.

With isolated models, a client still can figure out that ra/dec is in ICRS even after a full Meas2 migration. It will still work fine for anything that doesn't require figuring out the errors. Overall: much better use experience in many common cases (and graceful failure in the others).

Oh, and of course it's simple to provide Meas and Meas2 annotation at the same time, and it's straightforward to write clients to "use the best one"; see https://github.com/msdemlei/astropy#choosing-the-most-expressive-annotation for how clients would do that.

Sure, you could have multiple annotations with entangled models, too -- but then you'll have a tag soup explosion, because you will have to repeat all annotations in all the God blocks, most of them identical except for the DM names. Annotators will now be amused.

mcdittmar commented 3 years ago

On Thu, Apr 22, 2021 at 3:56 AM msdemlei @.***> wrote:

On Wed, Apr 21, 2021 at 06:37:48AM -0700, Mark Cresitello-Dittmar wrote:

Of course, I'd be particularly grateful if brief docs on solving the mentioned use cases (which clearly are not trivial, as evinced by the fact that we still can't do them 20 years in the VO) could easily be located.

Perhaps my point is that with the data properly annotated to the current models, these operations become trivial. With the 'Death Star' approach, there is no 'figuring out' where something is, you simply follow the path (modulo API helpers).

But that's different each time depending on whether you have a time series, a catalogue, or yet something else, right? So, a client that doesn't know SparseCube but just PhotDM would be out of luck when trying to work out the, say, band of the magnitudes in

https://github.com/ivoa/dm-usecases/blob/main/usecases/time-series/mcd-implementation/gavo_summary.md

right?

No..

If you are starting at the top (SparseCube vs Mango), the path to the Meas would, of course be different. If you only understand Meas and PhotDM, you can ( if using the rama package )

mags = doc.find_instances( Photometry )  << find the photometry Measure
band = mags[0].coord.coord_sys.frame.filter.name << band name of the

PhotometryFilter

assuming photDM:PhotometryFilter if off the CoordFrame.

This is always true, regardless of the data product in which the Photometry resides. You can do this because you know it is a Photometry type Measure, and therefore, the PhotometryFilter resides in the associated CoordFrame.

If you only understand PhotDM (and not Meas), not sure how you found 'mag', but.. you can still find all PhotometryFilters

filters = doc.find_instances( PhotometryFilter ) band = filters.name

If there is only 1 filter, that must apply to your mag, otherwise you have some troubles.

msdemlei commented 3 years ago

On Thu, Apr 22, 2021 at 07:21:19AM -0700, Mark Cresitello-Dittmar wrote:

On Thu, Apr 22, 2021 at 3:56 AM msdemlei @.***> wrote:

On Wed, Apr 21, 2021 at 06:37:48AM -0700, Mark Cresitello-Dittmar wrote:

If you are starting at the top (SparseCube vs Mango), the path to the Meas would, of course be different. If you only understand Meas and PhotDM, you can ( if using the rama package )
mags = doc.find_instances( Photometry )  << find the photometry Measure
band = mags[0].coord.coord_sys.frame.filter.name << band name of the
PhotometryFilter

assuming photDM:PhotometryFilter if off the CoordFrame.

Can I see this in a complete script? I'm wondering at the moment where the Photometry symbol comes from.

If you only understand PhotDM (and not Meas), not sure how you found 'mag', but.. you can still find all PhotometryFilters

filters = doc.find_instances( PhotometryFilter ) band = filters.name

If there is only 1 filter, that must apply to your mag, otherwise you have some troubles.

I hope you'll understand that this statement unnerves me a bit given we're talking about the solution to our 20-years-old coordinates troubles here...

So, why not attach the PhotometryFilter directly to the magnitude? It's clearly more robust and works just as well, no?

       -- Markus

lmichel commented 3 years ago

@msdemlei

But if I get shown plausible ways how they can do that with a reasonable effort, I'll stop whining immediately.

I spent last 2 days on an implementation aiming at satisfying your request . This implementation, based on Mango, can be used at different model friendships levels. 1- Doing nothing with the model 2- Just take the frames/filters in the globals, as you do. 3- Picking MCT instances by exploring the map of the MANGO parameters as I suggested for you. 4- Get the binding between model leaves and fields (e.g. pos + error is mapped on columns 2, 3, 9). That way, clients can read the data with their usual APIs and provide on demand data counterpart onto the model. This is close to what you are doing as well. 5- Directly getting data rows as model instances. I know that this way to proceed won't replace what is working very well so far, but I bet it will be attractive when data will get more complex (e.g. multiple objects in one table, joint data tables)

lmichel commented 3 years ago

@msdemlei

Conversely, I'd be interested in your opinions about the API I'm proposing over at https://github.com/msdemlei/astropy

My opinion is that we roughly agree on HOW to consume annotated data

having getters (selectors indeed) parametrized with model elements (or mapping keys).
I presented such API in JAVA in a TDIG session at Victoria (AFAR)

but we definitely disagree on WHAT to consume.

e.g. Cube instances vs sparse sets of properties

mcdittmar commented 3 years ago

if you only understand Meas and PhotDM, you can ( if using the rama package )
mags = doc.find_instances( Photometry ) << find the photometry Measure band = mags[0].coord.coord_sys.frame.filter.name << band name of the PhotometryFilter

assuming photDM:PhotometryFilter if off the CoordFrame.

Can I see this in a complete script? I'm wondering at the moment where the Photometry symbol comes from.

The Standard Properties case: Annotated CSC file The file has several PhotometryFilter bands, and FIELD 'col15' annotated as a meas:Photometry instance from EB1. NOTE: These PhotometryCoordSys elements point to the PhotometryFIlter instances as per the models.

  <INSTANCE dmtype="mango:coordinates.PhotometryCoordSys" ID="_photsys_EB1">
  <INSTANCE dmtype="mango:coordinates.PhotometryCoordSys" ID="_photsys_EB2">
  <INSTANCE dmtype="mango:coordinates.PhotometryCoordSys" ID="_photsys_EB3">
  <INSTANCE dmtype="mango:coordinates.PhotometryCoordSys" ID="_photsys_EB4">

          <INSTANCE dmtype="mango:measures.Photometry">
            <ATTRIBUTE dmrole="mango:measures.Photometry.coord">
              <INSTANCE dmtype="mango:coordinates.PhotometryCoord">
                <ATTRIBUTE dmrole="mango:coordinates.PhotometryCoord.luminosity">
                 <COLUMN dmtype="ivoa:RealQuantity" ref="col15"/>
                </ATTRIBUTE>
                <REFERENCE dmrole="coords:Coordinate.coordSys">
                  <IDREF>_photsys_EB1</IDREF>
                </REFERENCE>
              </INSTANCE>
            </ATTRIBUTE>
          </INSTANCE>

Python Script:

#!/usr/bin/env python
from rama.reader import Reader
from rama.reader.votable import Votable
from rama.models.mango import Photometry

infile = "ivoa_csc2_example_annotated.vot"
doc = Reader( Votable(infile) )

flux = doc.find_instances(Photometry)[0]
band = flux.coord.coord_sys.frame.name

print("Flux: {} [band: {}]".format( str(flux.coord.luminosity[0]), band) )

Produces: Flux: 1.7599881487057e-14 erg/s/cm^2 [band: CHANDRA/ACIS.broad]

Summary: The rama package interprets the annotation and generates python classes. As far as accessing the instances and using them goes, it makes NO DIFFERENCE, if the Photometry data is in a TimeSeries, Source, or are stand-alone. The package finds the nodes annotating the class and returns the instances.

If you only understand PhotDM (and not Meas), not sure how you found 'mag', but.. you can still find all PhotometryFilters filters = doc.find_instances( PhotometryFilter ) band = filters.name If there is only 1 filter, that must apply to your mag, otherwise you have some troubles.

I hope you'll understand that this statement unnerves me a bit given we're talking about the solution to our 20-years-old coordinates troubles here...

What is unnerving? The model associates the 'col15' flux Measure with the proper PhotometryFilter. With that relation defined, finding the band becomes trivial, without it there is. no way to know which of the 4 bands is relevant.. you really can't do it.

So, why not attach the PhotometryFilter directly to the magnitude? It's clearly more robust and works just as well, no?

That could be done, but the models try organize the concepts. The PhotometryFilter is not a direct property of the Photometry Measure, but rather associated metadata on the environment which produced the value. This sort of information is relegated to the associated coordinate Frame in all Measures. I could certainly see adding a helper method to the Photometry class which short-cuts this for users.

msdemlei commented 3 years ago

On Fri, Apr 23, 2021 at 10:18:23AM -0700, Mark Cresitello-Dittmar wrote:

Python Script:

#!/usr/bin/env python
from rama.reader import Reader
from rama.reader.votable import Votable
from rama.models.mango import Photometry

Ah, that's where the Photometry symbol comes from. So... that's generated code? I'll frankly say that anything requiring code generation gets large demerits in my book, and I think in many other peoples' books.

So, why not attach the PhotometryFilter directly to the magnitude? It's clearly more robust and works just as well, no?

That could be done, but the models try organize the concepts. The PhotometryFilter is not a direct property of the Photometry Measure, but rather associated metadata on the environment which produced the value. This sort of information is relegated to the associated coordinate Frame in all Measures. I could certainly see adding a helper method to the Photometry class which short-cuts this for users.

If "organize the concepts" actually makes things more difficult (and may even require adding helper methods), perhaps that's another indicator that we should think again whether this "organize the concepts" is actually useful. So: Do we get a benefit (which would be: "Things we can do that we could not otherwise do") for the cost of having to have a lot of extra classes, needing new classes whenever new sorts of measurements come up (although many existing classes would already work) and having an extra indirection?

Standard reminder: We'll have to sell this to the community. Enabling things is what will make them buy this. Making things more complicated just because we feel it's more beautiful will make them frown.

msdemlei commented 3 years ago

On Fri, Apr 23, 2021 at 07:54:38AM -0700, Laurent MICHEL wrote:

But if I get shown plausible ways how they can do that with a reasonable effort, I'll stop whining immediately.

I spent last 2 days on an implementation aiming at satisfying your request . This implementation, based on Mango, can be used at different model friendships levels. 1- Doing nothing with the model 2- Just take the frames/filters in the globals, as you do. 3- Picking MCT instances by exploring the map of the MANGO parameters as I suggested for you. 4- Get the binding between model leaves and fields (e.g. pos + error is mapped on columns 2, 3, 9). That way, clients can read the data with their usual APIs and provide on demand data counterpart onto the model. This is close to what you are doing as well. 5- Directly getting data rows as model instances. I know that this to proceed won't replace what is working very well so far, but I bet it will be attractive when when data will get more complex (e.g. multiple objects in one table, joint data tables)

Hm -- what you show doesn't quite solve what I think we most dearly need: I have a column (say, the user picked it in a plot widget), and now I simply want to figure out what's the reference frame it it's part of a position, or the bandpass if it's photometry.

I'd like to try this myself, though -- could you briefly point at what code I need to run this?

lmichel commented 3 years ago

mango_browser.get_parameters() returns a dictionary of all parameters with their frames. This dictionary embeds FIELD identifiers of the mapped columns (id = FIELD id, index = FIELD position)

see

mango_browser.get_data(limit=1) returns an object (among others things) binding the column numbers with the keys of the above dictionary.

So that you get connections between model parameters and VOTable columns.

ex: https://github.com/ivoa/modelinstanceinvot-code/blob/main/python/client/demo/ivoa_csc2_example.py

mcdittmar commented 3 years ago

On Fri, Apr 23, 2021 at 10:18:23AM -0700, Mark Cresitello-Dittmar wrote: Python Script:
(snip)

Ah, that's where the Photometry symbol comes from. So... that's generated code? I'll frankly say that anything requiring code generation gets large demerits in my book, and I think in many other peoples' books.

I don't see the relevance of the code origin to this discussion. The system in place has proved very useful for keeping the code current with the model changes. There is no requirement for using code generators.

You asked to see the script which gets the filter name from the magnitude.. What is the corresponding thread in your implementation?

With your proposal of a single Measurement type and uncoupled models, what makes the connection between the 'flux' Measurement and the corresponding Filter?
I see in your TS example the PhotCal instance has added an attribute to the PhotCal object for 'value' pointing to the flux column.. which is not part of the photDM spec (values are not part of PhotCal). I think we hit on this before.

So, why not attach the PhotometryFilter directly to the magnitude?

It occurred to me today that you probably were not advocating this approach, but merely being critical of where we DID decide to put the filters. To do as this suggests would mean generating an extension to Measurement which adds the reference to the PhotometryFilters.. This would go against both of your base assertions that 1 Measurement is sufficient, and that models should be decoupled.

If "organize the concepts" actually makes things more difficult (and may even require adding helper methods), perhaps that's another indicator that we should think again whether this "organize the concepts" is actually useful. So: Do we get a benefit (which would be: "Things we can do that we could not otherwise do")

well, one can trivially find the Photometry data and its corresponding Filter metadata..

"which clearly are not trivial, as evinced by the fact that we still can't do them 20 years in the VO"

msdemlei commented 3 years ago

On Mon, May 03, 2021 at 09:43:55PM -0700, Mark Cresitello-Dittmar wrote:

On Fri, Apr 23, 2021 at 10:18:23AM -0700, Mark Cresitello-Dittmar wrote: Ah, that's where the Photometry symbol comes from. So... that's generated code? I'll frankly say that anything requiring code generation gets large demerits in my book, and I think in many other peoples' books.

I don't see the relevance of the code origin to this discussion. The system in place has proved very useful for keeping the code current with the model changes. There is no requirement for using code generators.

If that is true, that's reassuring. How would the client code look like without the generated code?

You asked to see the script which gets the filter name from the magnitude.. What is the corresponding thread in your implementation?

Since it's the same logic as https://github.com/msdemlei/astropy#getting-an-error-for-a-column, I've not added it in the README, but it's (assuming you want to figure out the band name for a column col):

try:
  ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier
except IndexError:
  raise Exception(f"No photometry annotation on {col}")

With your proposal of a single Measurement type and uncoupled models, what makes the connection between the 'flux' Measurement and the corresponding Filter?

The value reference in phot:PhotCal.

I see in your TS example the PhotCal instance has added an attribute to the PhotCal object for 'value' pointing to the flux column.. which is not part of the photDM spec (values are not part of PhotCal). I think we hit on this before.

Yeah, that needs to be fixed; some of our DMs have to be taught to talk about what they annotate. But if we want to de-entangle them, some minor adjustments will be necessary anyway.

So, why not attach the PhotometryFilter directly to the magnitude?

It occurred to me today that you probably were not advocating this approach, but merely being critical of where we DID decide to put

Ummm... of course I'd like to attach the filter to the column (or param) it applies to. Why would I not?

the filters. To do as this suggests would mean generating an extension to Measurement which adds the reference to the PhotometryFilters.. This would go against both of your base assertions that 1 Measurement is sufficient, and that models should be decoupled.

Right. That's one reason to directly annotate the column with the PhotCal information.

If "organize the concepts" actually makes things more difficult (and may even require adding helper methods), perhaps that's another indicator that we should think again whether this "organize the concepts" is actually useful. So: Do we get a benefit (which would be: "Things we can do that we could not otherwise do")

well, one can trivially find the Photometry data and its corresponding Filter metadata..

Isn't what I show above about as trivial?

In think you find it's not: Building a large and arguably fragile tree (network?) of models to hide what complexity is unavoidable will, I expect, backfire. All my programming practice has taught me that getting such abstractions right is really hard. Let's try how far we get without having to introduce them.

mcdittmar commented 3 years ago

From #15: May 4, 2021 - Markus wrote:

With just one Measurement class (or perhaps a few when we add proper distributions), with the API of https://github.com/msdemlei/astropy, all TOPCAT would need to do is:
ann = col.get_annotations("meas:Measurement")
if ann:
 associated_error = ann.naive_error
(or it would try a few attributes it knows how to plot).

With current Meas, it will, as far as I can see, have to do something like
MEAS_CLASSES = ["meas:GenericMeasure", "meas:Time",
    "meas:Position", "meas:Velocity", "meas:ProperMotion"]
# I'm leaving out Polarisation because it really doesn't belong here

for class_name in MEAS_CLASSES:
  ann = col.get_annotations(class_name)
  if ann:
    associated_error = ann.naive_error
And, worse, each time we invent a new Measure subclass, it will have to amend MEAS_CLASS.

That's a high price to pay; it would be worth paying if we got a major benefit from it. But I can't even see a minor one.

This is the 'find all measures' case. With rama package, since these are all children of meas:Measure:

doc = Reader( Votable(infile) )
measures = doc.find_instances(Measure)

for item in measures:
    print("item type == {}, has Error == {}".format(str(type(item)), "FALSE" if item.error is None else "TRUE" ))

produces item type == <class 'rama.models.measurements.Time'>, has Error == FALSE item type == <class 'rama.models.measurements.GenericMeasure'>, has Error == FALSE

I'll note that this is not finding the Measures defined under the mango model, (Photometry, HardnessRatio), but I'd consider this a bug on the package.

mcdittmar commented 3 years ago

If that is true, that's reassuring. How would the client code look like without the generated code?

I have no idea.. My expectation is that the rama package will become production ready once we've settled on models and annotation. At that point, clients can use it, or write their own parser, or use someone else's package. Given your reaction to the code genesis, maybe you can consider it a 'worst case experience' for clients

You asked to see the script which gets the filter name from the magnitude.. What is the corresponding thread in your implementation?

Since it's the same logic as https://github.com/msdemlei/astropy#getting-an-error-for-a-column, I've not added it in the README, but it's (assuming you want to figure out the band name for a column col):
 try:
    ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier 
 except IndexError:
    raise Exception(f"No photometry annotation on {col}")

The case is different in that the PhotCal and Measure are in different models.

given a 'flux' Measure, how do you find the corresponding filter band? What you show merely finds the first PhotCal instance, right?

With your proposal of a single Measurement type and uncoupled models, what makes the connection between the 'flux' Measurement and the corresponding Filter?

The value reference in phot:PhotCal.

I see in your TS example the PhotCal instance has added an attribute to the PhotCal object for 'value' pointing to the flux column.. which is not part of the photDM spec (values are not part of PhotCal). I think we hit on this before.

Yeah, that needs to be fixed; some of our DMs have to be taught to talk about what they annotate. But if we want to de-entangle them, some minor adjustments will be necessary anyway.

With your proposal, you can only accomplish this task because you've made an adhoc change to the PhotDM model which is conceptually wrong. Photometric Filters conceptually stand on their own, they do not know anything about the data that was taken using them. Many flux/mag columns can use the same filter. And presumably, many other sorts of objects can reference a PhotCal instance.

So, why not attach the PhotometryFilter directly to the magnitude?

It occurred to me today that you probably were not advocating this approach, but merely being critical of where we DID decide to put

Ummm... of course I'd like to attach the filter to the column (or param) it applies to. Why would I not?

Attaching it to the column is not the same as attaching it to the magnitude (Measure)

lmichel commented 3 years ago

I'm not expert in Rama package but I guess that the generated code is basically object classes matching the model elements with appropriate accessors (setter and getter). This works fine once your object is properly set. My concern about this approach is to know how properly setting these instances. In the archival data world, you have to consider that some VOTable data won't ever match exactly what generated code is expecting: e.g.

Numerical value given as string
positions given in unexpected format.

It is hard to imagine that generated code will be able to tackle with all possible cases. In these condition, I think that a library based on generated code should be used behind some hand-made wrapper.

For this reason I prefer to allow clients to store annotations as dictionnaries (Python) and to let them to picking data there with appropriate selectors (see here).

msdemlei commented 3 years ago

On Tue, May 04, 2021 at 09:52:05AM -0700, Mark Cresitello-Dittmar wrote:

Since it's the same logic as https://github.com/msdemlei/astropy#getting-an-error-for-a-column, I've not added it in the README, but it's (assuming you want to figure out the band name for a column col):
 try:
    ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier 
 except IndexError:
    raise Exception(f"No photometry annotation on {col}")
The case is different in that the PhotCal and Measure are in different models.

given a 'flux' Measure, how do you find the corresponding filter band? What you show merely finds the first PhotCal instance, right?

This is another case where entangling DMs is pernicious: Photometry shouldn't have anything to do with errors (i.e., Meas). The whole problem disappears when you drop the notion that there should be something like a FluxMeasure (or whatever, I can't find it in the 2020-04-13 PR).

When you drop classes and things work even better, to mq that's a clear indication that the class shouldn't be there.

Yeah, that needs to be fixed; some of our DMs have to be taught to talk about what they annotate. But if we want to de-entangle them, some minor adjustments will be necessary anyway.

With your proposal, you can only accomplish this task because you've made an adhoc change to the PhotDM model which is

Well, yes, this thread is about changes to our models to make them work better (and more likely to be adopted).

conceptually wrong. Photometric Filters conceptually stand on their own, they do not know anything about the data that was taken using them. Many flux/mag columns can use the same filter. And presumably, many other sorts of objects can reference a PhotCal instance.

That's how the model works right now, and I argue that we shouldn't be doing it that way.

You see, it's conceptually as valid to say "there's photometry metadata attached to this column" -- that in reality there actually is a filter (as in the concrete thing that you put somewhere into your optical path) somewhere that has some relation to the photometry metadata doesn't mean it makes operational sense to model that particular artefact. Not in principle, and here in particular because between that filter and the flux you have in a column there's a lot of additional stuff happening.

Concrete example, as for instance in SDSS: you have PSF and aperture photometry. You could perhaps shave off a few elements by sharing the "z filter that was in the optical path" between the PSF and aperture photometry annotation, but there's a lot of complexity involved in that that's a lot more expensive in implemenation than just keeping that piece of information (and perhaps a few additional infos characterising it) local to both the PSF and aperture annotations.

So, again, I argue you're paying a high price of a small benefit.

Ummm... of course I'd like to attach the filter to the column (or param) it applies to. Why would I not?

Attaching it to the column is not the same as attaching it to the magnitude (Measure)

No, it's not, because there's an additional indirection, and you're entangling Meas and Phot. Suppose for a moment that there is -- as I claim -- a price for that entangling, and there's a price for the extra Meas class: having it on the column saves us from having to pay both.

msdemlei commented 3 years ago

On Tue, May 04, 2021 at 08:25:35AM -0700, Mark Cresitello-Dittmar wrote:

From #15: May 4, 2021 - Markus wrote:
With just one Measurement class (or perhaps a few when we add proper distributions), with the API of https://github.com/msdemlei/astropy, all TOPCAT would need to do is:
ann = col.get_annotations("meas:Measurement")
if ann:
 associated_error = ann.naive_error
That's a high price to pay; it would be worth paying if we got a major benefit from it. But I can't even see a minor one.
This is the 'find all measures' case.

Perhaps in a somewhat roundabout sense; what TOPCAT would really want to do in this use case is "Find a Measurement instance on the column I'm about to plot", which arguably can do with a shortcut over "enumerate all measurements and see if I can find one on my column".

With rama package, since these are all children of meas:Measure:

Yeah, that's the problem: You need a complex package (with, as you say, bugs of its own) that needs to access and parse VO-DML.

Without the per-physics classes all you need is a simple, model-indepentent, extension to a VOTable parser (it's a 400 lines diff in astropy/io/votable/tree.py of https://github.com/msdemlei/astropy) and have it right there, without having to deal with VO-DML as such (which, don't get me wrong, still is great, and validators will need to deal with it; it's just that normal clients shouldn't have to).

I'm here because I'm convinced this sort of simplicity will eventually decide whether after we're done here we'll have everyday clients consuming our annotations -- or whether we'll go on as it's been for the past 15 years, with people ignoring our DMs and there's still no usable way to automatically plot error bars (or have epoch sliders without guessing).

mcdittmar commented 3 years ago

I'm not expert in Rama package but I guess that the generated code is basically object classes matching the model elements with appropriate accessors (setter and getter). In the archival data world, you have to consider that some VOTable data won't ever match exactly what generated code is expecting: e.g.

Numerical value given as string

positions given in unexpected format.

Lets not side-track this issue with a critique of the Rama package.

the auto-generated model classes are merely end-points presented to the user; instantiated with the find_instances() command.
there is 'regular' parser code which interprets the annotation and VOTable content
- uses AstroPy QTable to interpret the VOTable data content.
  - which conveniently creates AstroPy Quantity types for the data, along with all its Unit handling features.
  - I did hit a bug in QTable, where it could not understand RA/DEC in HMS/DMS notation.. turns out this was reported and fixed a couple weeks ago, and is slated for a future release.
- it is at this level that the data inconsistency handling occurs

The community has had a LOT of questions about how compatible the models are with AstroPy, one of the main goals of the implementation was to leverage AstroPy as much as possible. So it uses QTable, Quantity, has converters for SkyCoord and Time.

lmichel commented 3 years ago

Thank you for these clarifications.

mcdittmar commented 3 years ago

On Tue, May 04, 2021 at 09:52:05AM -0700, Mark Cresitello-Dittmar wrote: > Since it's the same logic as https://github.com/msdemlei/astropy#getting-an-error-for-a-column, I've not added it in the README, but it's (assuming you want to figure out the band name for a column col):
try: 
  ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier 
except IndexError: 
  raise Exception(f"No photometry annotation on {col}") 
The case is different in that the PhotCal and Measure are in different models.

given a 'flux' Measure, how do you find the corresponding filter band? What you show merely finds the first PhotCal instance, right?
This is another case where entangling DMs is pernicious: Photometry shouldn't have anything to do with errors (i.e., Meas). The whole problem disappears when you drop the notion that there should be something like a FluxMeasure (or whatever, I can't find it in the 2020-04-13 PR). When you drop classes and things work even better, to mq that's a clear indication that the class shouldn't be there.

The Measure containing 'flux' does have errors. In your proposal, this would be a Measure annotating the appropriate columns 'flux', 'flux_err'.
1. Your use case is: 'If I have a flux or magnitude, figure out the band it's in, perhaps a zeropoint, etc.'
2. so, given that you've found one of several annotations for a 'flux' measure; what is your thread for finding the corresponding band? All you've shown is that you can find a PhotCal, not necessarily the correct one.
I'm not sure Photometry has been a component in any of the Meas/Coord drops.. it was in Spectral and an obvious candidate for expansion from the basics since it has an association with the PhotometryFilter.
What qualifies Photometry to be a special type (rather than being handled by GenericMeasure) is that there is associated metadata.. namely the PhotometryFilter. The specialized model class tells the provider and client what the expected associated metadata is, and where to find it. And that is the crux of different approaches.

conceptually wrong. Photometric Filters conceptually stand on their own, they do not know anything about the data that was taken using them. Many flux/mag columns can use the same filter. And presumably, many other sorts of objects can reference a PhotCal instance.

That's how the model works right now, and I argue that we shouldn't be doing it that way. You see, it's conceptually as valid to say "there's photometry metadata attached to this column" -- that in reality there actually is a filter (as in the concrete thing that you put somewhere into your optical path) somewhere that has some relation to the photometry metadata doesn't mean it makes operational sense to model that particular artefact.

Everyone who has worked IVOA models to date has taken the current approach. That is a LOT of history by a lot of people. The models should define the entities involved and their relations. They need to support several different uses, from DAL protocols, to Applications, and Serializations. Your approach focuses solely on the Serialization aspect, has no formal model backing, and has only been applied to very simple data structures and cases. It is a VERY big ask to expect the working group to change its entire approach based on this.

Ummm... of course I'd like to attach the filter to the column (or param) it applies to. Why would I not?

Attaching it to the column is not the same as attaching it to the magnitude (Measure)

No, it's not, because there's an additional indirection, and you're entangling Meas and Phot. Suppose for a moment that there is -- as I claim -- a price for that entangling, and there's a price for the extra Meas class: having it on the column saves us from having to pay both.

I'm not entangling Meas and Phot, I'm defining an association (in the model) between flux data and photometry metadata. "column" is not a model construct, it is a serialization construct (see above).

These last bits are off the issue topic, and back on the old circular debate, so I think I'll try to leave it here.

mcdittmar commented 3 years ago

This is the 'find all measures' case.

Perhaps in a somewhat roundabout sense; what TOPCAT would really want to do in this use case is "Find a Measurement instance on the column I'm about to plot", which arguably can do with a shortcut over "enumerate all measurements and see if I can find one on my column".

With rama package, since these are all children of meas:Measure:

Yeah, that's the problem: You need a complex package (with, as you say, bugs of its own) that needs to access and parse VO-DML. Without the per-physics classes all you need is a simple, model-indepentent, extension to a VOTable parser (it's a 400 lines diff in astropy/io/votable/tree.py of https://github.com/msdemlei/astropy) and have it right there, without having to deal with VO-DML as such (which, don't get me wrong, still is great, and validators will need to deal with it; it's just that normal clients shouldn't have to). I'm here because I'm convinced this sort of simplicity will eventually decide whether after we're done here we'll have everyday clients consuming our annotations -- or whether we'll go on as it's been for the past 15 years, with people ignoring our DMs and there's still no usable way to automatically plot error bars (or have epoch sliders without guessing).

Markus.. this isn't the problem. I'm showing how this would be done IF someone were using the rama package. This does not set any requirements on anyone else to use the same approach. If TOPCAT has this job to do, and the two options are available, they'd select the best approach for their requirements.

FWIW: the rama package uses astropy.io.votable in the parser code, though not sure which pieces at the moment.

msdemlei commented 3 years ago

On Wed, May 05, 2021 at 09:08:35AM -0700, Mark Cresitello-Dittmar wrote:

This is another case where entangling DMs is pernicious: Photometry shouldn't have anything to do with errors (i.e., Meas). The whole problem disappears when you drop the notion that there should be something like a FluxMeasure (or whatever, I can't find it in the 2020-04-13 PR). When you drop classes and things work even better, to mq that's a clear indication that the class shouldn't be there.

The Measure containing 'flux' does have errors. In your proposal, this would be a Measure annotating the appropriate columns 'flux', 'flux_err'.

Yes.

1. Your use case is: 'If I have a flux or magnitude, figure out
the band it's in, perhaps a zeropoint, etc.'

Yes -- which, in my world, is unrelated to the flux or magnitude also having errors.

2. so, given that you've found one of several annotations for a
'flux' measure; what is your thread for finding the
corresponding band?  All you've shown is that you can find a
PhotCal, not necessarily the correct one.

No. When you say:

table["mag_g"].get_annotations("phot:PhotCal")

you get the annotation(s) exactly for mag_g and for nothing else. No need to guess.

What qualifies Photometry to be a special type (rather than being handled by GenericMeasure) is that there is associated metadata.. namely the PhotometryFilter. The specialized model class tells the provider and client what the expected associated metadata is, and where to find it. And that is the crux of different approaches.

Perhaps, but I still wonder what motivates you to prefer this to simply associating that extra metadata to the annotated column itself. What sort of client would prefer to first go to a measurement annotation and then to the PhotCal rather than go to the PhotCal directly? Why would data providers want to have that intermediate class?

Is it really just that you'd like to model Filter as some sort of external thing? If so, one could of course make the filter a reference in the Photometry annotation, and again, there would be no need for cross-model references.

That's how the model works right now, and I argue that we shouldn't be doing it that way. You see, it's conceptually as valid to say "there's photometry metadata attached to this column" -- that in reality there actually is a filter (as in the concrete thing that you put somewhere into your optical path) somewhere that has some relation to the photometry metadata doesn't mean it makes operational sense to model that particular artefact.

Everyone who has worked IVOA models to date has taken the current approach. That is a LOT of history by a lot of people. The models

I hate to be blunt, but: that has given us essentially zero uptake. I wouldn't be here if we had a way to solve these basic uses cases, which DM should have provided by (at least) VOTable 1.2 in 2009.

I'm not blaming anyone, and I've been part of the failure myself, but: If something hasn't worked out for more than 10 years, wouldn't you agree it's an indication that we've been doing it wrong?

Applications, and Serializations. Your approach focuses solely on the Serialization aspect, has no formal model backing, and has only

What would count as "formal model backing"?

been applied to very simple data structures and cases. It is a VERY big ask to expect the working group to change its entire approach based on this.

The "very simple data structures" argument has been levelled a few times, but nobody has pointed me towards a case where disentangled models really work worse than entangled ones.

Sure, there is the case of severely denormalised tables that I, well, refuse to do, but that is completely independent of model architecture and a question of how complex we want our annotation to be.

So... what kind of complex data are you concerned about?

I'm not entangling Meas and Phot, I'm defining an association (in the model) between flux data and photometry metadata.

Well, call it what you want, but you have a reference from Meas to Phot if you say "In the PhotometryMeasure, there is a filterDef that points to photdm:PhotCal", right? And with that, we're back to the old thing that going to Photdm2 will require an incompatible change to Meas, too, which is what I, perhaps somewhat derogatorily, call entangling (I'll happily use some other term if you prefer).

mcdittmar commented 3 years ago

The Measure containing 'flux' does have errors. In your proposal, this would be a Measure annotating the appropriate columns 'flux', 'flux_err'.

Yes.

Your use case is: 'If I have a flux or magnitude, figure out the band it's in, perhaps a zeropoint, etc.'

Yes -- which, in my world, is unrelated to the flux or magnitude also having errors.

so, given that you've found one of several annotations for a 'flux' measure; what is your thread for finding the corresponding band? All you've shown is that you can find a PhotCal, not necessarily the correct one.

No. When you say: table["mag_g"].get_annotations("phot:PhotCal") you get the annotation(s) exactly for mag_g and for nothing else. No need to guess.

Ah.. now I see where 'col' comes from..

ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier

It might be good to add a similar line in your README.rst as well to show what your starting point is. Question: table["mag_g"] returns the FIELD? and get_annotations() returns the PhotCal block associated with the same FIELD.

That makes sense... thanks.

mcdittmar commented 3 years ago

I hate to be blunt, but: that has given us essentially zero uptake. I wouldn't be here if we had a way to solve these basic uses cases, which DM should have provided by (at least) VOTable 1.2 in 2009. I'm not blaming anyone, and I've been part of the failure myself, but: If something hasn't worked out for more than 10 years, wouldn't you agree it's an indication that we've been doing it wrong?

Yes.. and the work that's been done with VODML, the revamp of these models, the annotation syntax instead of UTypes has all been directed at resolving those problems.

Applications, and Serializations. Your approach focuses solely on the Serialization aspect, has no formal model backing, and has only What would count as "formal model backing"?

Datamodels, written out, with consideration of their use cases. We know you've changed the models (Meas, Coords, PhotDM).. how? what do these new models look like? PhotDM, for one, is a REC.. what are the ramifications of adding the data to PhotCal? What other objects would need to change?

The "very simple data structures" argument has been levelled a few times, but nobody has pointed me towards a case where disentangled models really work worse than entangled ones. Sure, there is the case of severely denormalised tables that I, well, refuse to do, but that is completely independent of model architecture and a question of how complex we want our annotation to be. So... what kind of complex data are you concerned about?

These workshop cases have several levels of complexity in the data structure. Implementing these with your proposal would go a long way to supporting your argument. Just with the TimeSeries case we now have:

the simple TimeSeries that you provided: 1 Table = 1 LightCurve
ZTF TimeSeries : 1 Table = n LightCurves (one per source in the field-of-view)
GAIA multiband: 1 Table = m*n LightCurves (one per band per source) with compact structure

So far, we haven't seen that the decoupled models approach works at all in these other cases.

msdemlei commented 3 years ago

On Thu, May 06, 2021 at 08:04:39AM -0700, Mark Cresitello-Dittmar wrote:

What would count as "formal model backing"?

Datamodels, written out, with consideration of their use cases.

Since I can't stand the existing tooling for VO-DML, and writing something that works nicely for me is a bit of a chore: Would it suffice if I have, say, python class statements?

refuse to do, but that is completely independent of model architecture and a question of how complex we want our annotation to be. So... what kind of complex data are you concerned about?

These workshop cases have several levels of complexity in the data structure. Implementing these with your proposal would go a long way to supporting your argument. Just with the TimeSeries case we now have:

the simple TimeSeries that you provided: 1 Table = 1 LightCurve

ZTF TimeSeries : 1 Table = n LightCurves (one per source in the field-of-view)

GAIA multiband: 1 Table = m*n LightCurves (one per band per source) with compact structure

So far, we haven't seen that the decoupled models approach works at all in these other cases.

As I said, the Gaia serialisation is so denormalised that it's simply no table, and the DPAC folks will happily change their serialisation format if we tell them what to do and what to avoid. Actually, Gaia CU5 (which produces these time series) unsurprisingly uses properly normalised tables internally.

Now, we can add some normalisation layer (preferably SQL-based) for denormalised, non-tabular data, but that's neither annotation nor modelling, and as that's complex and difficult, it's prudent to not try this on the first go after quite a few failures. Conversely: Try to make a spec work for broken things you you'll get a broken spec.

The ZTF thing I can't find in the dm-usecases, but again this sounds more like a normalisation issue; if you deliver the equivalent of a

SELECT * FROM
  ts_1
  UNION
  ts_2
  UNION
  ts3

it stands to reason that pulling these tables apart again is best done using relational algebra (which produced them) rather than writing your model to account for this kind of thing.

More importantly, though: You can still annotate photometry, errors and positions on this, even if you can't do a time series annotation (because it isn't a time series). And that's the big bonus of isolated DMs: they can be produced and consumed whenever something approriate is in view, not only if "the whole thing is a time series" (and regardless of how many bells and whistles you attach to either your models or your annotation syntax, there will always be plenty of tables that look just like a time series but in the end aren't quite).

msdemlei commented 3 years ago

On Thu, May 06, 2021 at 07:29:11AM -0700, Mark Cresitello-Dittmar wrote:

No. When you say: table["mag_g"].get_annotations("phot:PhotCal") you get the annotation(s) exactly for mag_g and for nothing else. No need to guess.

Ah.. now I see where 'col' comes from..
ann = col.get_annotations("phot:PhotCal")[0].filterIdentifier 
It might be good to add a similar line in your README.rst as well to show what your starting point is. Question: table["mag_g"] returns the FIELD? and get_annotations() returns the PhotCal block associated with the same FIELD.

Yes on both. I'll see if I can make the README a bit more explicit.

lmichel commented 3 years ago

The ZTF thing I can't find in the dm-usecases

here it is

GROUPING, JOINING and FILTERING rows are basic operations that do not need a complex algebra. These are 3 statements setup to resolve well identified cases: 1- data of different objects mixed in one table (if you dislike ZTF, Gilles showed up Vizier examples). 2- data spread over multiple tables. IMO it is wise to consider now that some data providers will soon take advantage of the multi-table support in VOTable to publish complex data set (with 1-n relations)

lmichel commented 3 years ago

Saying that DM is a 10 years failure is an overstatement. Suggesting that the reason for this lack of success (I do prefer) is the development of integrated models is (huummmm) biased.

STC1.33 was not an integrated model, however it has been only used for as STS-S (great idea before the MOCs) and as a documentation base. The Markus proposal to serialize it didn't succeed. I don't know why.

The main reason for this situation is more likely to be searched in the lack of enthusiasm from both providers and developers to work with annotated VOTables. I take my part of responsibility for not having succeeded to dramatically change this situation. This workshop is however a good step forward.

I won't rewrite the history here, but I just hope we will find an good (*) exit way.

(*) anything different from a definitive let's get ride of this modeling stuff

msdemlei commented 3 years ago

On Fri, May 07, 2021 at 01:28:30AM -0700, Laurent MICHEL wrote:

The ZTF thing I can't find in the dm-usecases

here it is

Ok -- except for the transformation necessary to actually make this a time series (i.e., splitting it up per oid) and which I keep maintaining is beyond what annotation should do), I don't see a particular complication -- this basically works like my gaia time series example, except that you have two independent axes, so you'd have two cube groups, somewhat like this (again, SIL shorthand):

(ndcube:Cube) {
    independent_axes: [hjd]
    dependent_axes: [mag]
}

(ndcube:Cube) {
    independent_axes: [mjd]
    dependent_axes: [mag]
}

Well, and the photCal wouldn't be immediate:

(phot:PhotCal) {
    filterIdentifier: @filtercode
    zeroPointFlux: @magzp
    magnitudeSystem: ???
    effectiveWavelength: ???
    value: @mag
}

Or am I missing something?

GROUPING, JOINING and FILTERING rows are basic operations that do not need a complex algebra.

Well, things become complex when you consider the various conditions you need to express here, think about inner and outer joins, etc -- and, really, that is an algebra. Be that as it may: fortuantely annotation can be cleanly separated from structural modification, and we really should use this opportunity. Let's not encumber takeup of the basic features by requiring our adopters to re-implement a significant part of a SQL engine.

Once we've got people hooked, selling them structural transformations (in whatever way) is going to be much easier.

These are 3 statements setup to resolve well identified cases: 1- data of different objects mixed in one table (if you dislike ZTF, Gilles showed up Vizier examples). 2- data spread over multiple tables. IMO it is wise to consider now that some data providers will soon take advantage of the multi-table support in VOTable to publish complex data set (with 1-n relations)

Oh, case 2 is actually sane and non-legacy. It can be handled just fine with just annotation (saying something like "x is a foreign key in y").

lmichel commented 3 years ago

Well, things become complex when you consider the various conditions

The conditions are perfectly set in the spec.

Oh, case 2 is actually sane and non-legacy. It can be handled just fine with just annotation (saying something like "x is a foreign key in y").

Namely a JOIN