ivoa-std / DataLink

DataLink standard (DAL)
3 stars 6 forks source link

Corresponding row identifier in links response table #88

Closed mbtaylor closed 1 year ago

mbtaylor commented 1 year ago

I would like to be able to identify "corresponding" rows in the DataLink tables associated with different rows of the same catalogue.

Example: a user looks at a DataLink table for a row from the Gaia DR3 source catalogue. They choose to plot data from the row corresponding to the XP Sampled Spectrum. They then choose a different source catalogue row, and are likely to want the XP Sampled Spectrum again. But there is currently no reliable mechanism for the client to identify the row in the second DL table (if any) that's "the same as" or "corresponding to" the row the user selected in the first DL table. At present I'm hacking round this in topcat by matching parts of the semantics and description columns. I identified this issue in my Datalink Feedback presentation at the Victoria 2018 Interop (see the slide marked 7/8 "Row Correspondance"), but I don't think it's made it into a github issue till now.

So I would like to introduce a new optional column in the links response table (I don't know what's the best name - "type_code"?) into which services can place a suitable identifier (meaningful only within that service). In my Gaia DR3 example, all XP Sampled spectra could have the value "1", all RVS spectra "2", etc.

msdemlei commented 1 year ago

On Fri, Jul 08, 2022 at 10:29:17AM -0700, Mark Taylor wrote:

So I would like to introduce a new optional column in the links response table (I don't know what's the best name - "type_code"?) into which services can place a suitable identifier (meaningful only within that service). In my Gaia DR3 example, all XP Sampled spectra could have the value "1", all RVS spectra "2", etc.

This sounds reasonable to me, although I'm not so wild about the term "type_code", which combines two of the most overloaded terms in IT. Actually, I can top this: resource_type_code.

More seriously, what this should be is an identifier to build correspondences between rows in different datalink responses from the same service. So, in a way, having "correspondence" in the name would be nice, except that's too many chars to type and too many letters to print in table headers an the like. In particular, when one would write "correspondence_association". Perhaps "corr_assoc"? Or "assoc_id"?

"cross" is nice and short, and since it's an identifier: "cross_id"? But that would sound a bit as if it were a true foreign key, which this is not (mainly because there are many tables, and they're all peers).

In the end, what I'm least unhappy with at this point would probably be:

column name: local_semantics type: text UCD: meta.id.assoc description: An identifier that allows clients to associate rows from different datalink documents on the same service with each other.

I've briefly thought about making this an int, but that would greatly increase the risk of creating collisions between local_semantics of different services. These aren't bad in theory (they're local identifiers, after all), but in practice there will be bugs, and these will make clients try to correlate responses from different services. Having "bp spectrum" rather than "1" here will give some resilience here.

A very relevant point, though: I think to make this work as Mark imagines it, we have to require that within each document, a local_semantics value must not occur twice, as otherwise a client would still be confused about which row to choose.

Well: Should we take this to the DAL list?

mbtaylor commented 1 year ago

Markus, thanks for comments. local_semantics is OK by me. I'm happy to use a MUST-be-unique-per-table value and a text value as, and for the reasons, you suggest; however since this is going to be a best-efforts affair (the column is optional after all) a SHOULD-be-unique-per-table value and any datatype preferred by data provider would be OK by me too.

Procedurally: should this go to the DAL list or is a github issue the right place for the discussion? Should I have started off on the DAL list (I wasn't sure). I don't have strong feelings. A call for DAL chair/vice-chair (@jd-au / @gmantele)?

Bonnarel commented 1 year ago

Le 08/08/2022 à 14:01, Mark Taylor a écrit :

Markus, thanks for comments. |local_semantics| is OK by me. I'm happy to use a MUST-be-unique-per-table value and a text value as, and for the reasons, you suggest; however since this is going to be a best-efforts affair (the column is optional after all) a SHOULD-be-unique-per-table value and any datatype preferred by data provider would be OK by me too.

Procedurally: should this go to the DAL list or is a github issue the right place for the discussion? Should I have started off on the DAL list (I wasn't sure). I don't have strong feelings. A call for DAL chair/vice-chair @.*** https://github.com/jd-au / @gmantele https://github.com/gmantele)?

I Think you could post this to the DAL list, Mark : more visble than github discussions (except if James/Gregory object of course).

— Reply to this email directly, view it on GitHub https://github.com/ivoa-std/DataLink/issues/88#issuecomment-1208035465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMP5LTEU6IUQY6GWPTALKALVYDZH5ANCNFSM53BUWYXA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jd-au commented 1 year ago

No objections to it going to the list - others may find it useful and/or have suggestions.

mbtaylor commented 1 year ago

Done: http://mail.ivoa.net/pipermail/dal/2022-August/008599.html

Bonnarel commented 1 year ago

Dear Mark, Markus Le 25/07/2022 à 15:56, msdemlei a écrit :

On Fri, Jul 08, 2022 at 10:29:17AM -0700, Mark Taylor wrote:

So I would like to introduce a new optional column in the links response table (I don't know what's the best name - "type_code"?) into which services can place a suitable identifier (meaningful only within that service). In my Gaia DR3 example, all XP Sampled spectra could have the value "1", all RVS spectra "2", etc.

This sounds reasonable to me, although I'm not so wild about the term "type_code", which combines two of the most overloaded terms in IT. Actually, I can top this: resource_type_code.

More seriously, what this should be is an identifier to build correspondences between rows in different datalink responses from the same service. So, in a way, having "correspondence" in the name would be nice, except that's too many chars to type and too many letters to print in table headers an the like. In particular, when one would write "correspondence_association". Perhaps "corr_assoc"? Or "assoc_id"?

"cross" is nice and short, and since it's an identifier: "cross_id"? But that would sound a bit as if it were a true foreign key, which this is not (mainly because there are many tables, and they're all peers).

In the end, what I'm least unhappy with at this point would probably be:

column name: local_semantics type: text UCD: meta.id.assoc description: An identifier that allows clients to associate rows from different datalink documents on the same service with each other.

Sounds good. +1 I've briefly thought about making this an int, but that would greatly increase the risk of creating collisions between local_semantics of different services. These aren't bad in theory (they're local identifiers, after all), but in practice there will be bugs, and these will make clients try to correlate responses from different services. Having "bp spectrum" rather than "1" here will give some resilience here. +1

A very relevant point, though: I think to make this work as Mark imagines it, we have to require that within each document, a local_semantics value must not occur twice, as otherwise a client would still be confused about which row to choose.

This point I don't understand.

      a ) the {links} response document may be valid  for more than a single ID (Solar System and EPN-TAP have use cases for that). So the document is too much.

     b ) even if we restrict the constraint to one local_semantics value for each ID, I'm not sure. Why should we have one single XP spectrum for each ID ?

Cheers

François

Well: Should we take this to the DAL list?

— Reply to this email directly, view it on GitHub https://github.com/ivoa-std/DataLink/issues/88#issuecomment-1194082278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMP5LTBOANHVOG4JCXE335LVV2MIBANCNFSM53BUWYXA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

msdemlei commented 1 year ago

Hi François,

On Tue, Oct 11, 2022 at 01:43:03AM -0700, François Bonnarel wrote:

A very relevant point, though: I think to make this work as Mark imagines it, we have to require that within each document, a local_semantics value must not occur twice, as otherwise a client would still be confused about which row to choose.

This point I don't understand.

      a ) the {links} response document may be valid  for more than a single ID (Solar System and EPN-TAP have use cases for that). So the document is too much.

True; it would be per id rather than per document.

     b ) even if we restrict the constraint to one local_semantics value for each ID, I'm not sure. Why should we have one single XP spectrum for each ID ?

Well, Mark wants to know which link to display when TOPCAT jumps from id to id. For instance, if a user looks at "RP spectrum" of star 3023, it stands to reason they'll also want to view the "RP spectrum" when they change to star 3302. When there are two "RP spectra" in the datalink response for star 3302, he's back to guesswork, and the proposal's goal is to remedy that.

I think Mark is more relaxed about this than I am, and so I won't insist on "Gah! You're invalid if you have two local_semantics!" if people feel that's too strong. But I think it would be wise to at least recommend ("should", i.e., validators would be entitled to raise warnings) having unique local_semantics per id.

        -- Markus
mbtaylor commented 1 year ago

Addressed by PR #97