ivoa-std / DataLink

DataLink standard (DAL)
3 stars 6 forks source link

clarify use of multiple ID values and overflow; resolves #45 #49

Closed pdowler closed 3 years ago

pdowler commented 3 years ago

clarify that all links for an ID must be in consecutive rows of the table

overflow also clarified in DALI: https://github.com/ivoa-std/DALI/pull/8

msdemlei commented 3 years ago

I fully agree with the change of wording on limits; posing the limit in terms of input ids produces the intended effect much more elegantly than what we previously had.

I'm less sure about the new "consecutive rows" requirement. While I believe this probably rarely is an issue in practice (although I'd have to check the DaCHS code; it could be that all #proc links are grouped at the end right now), it certainly has the potential to invalidate existing services, and anyway it's an extra constraint that I think should be backed by a clear rationale.

So: why do you (or someone else?) want this, and what use case does this enable or facilitate?

pdowler commented 3 years ago

The "group all links for an ID" came up in a recent DAL telecon (@Zarquan) and it was thought to be something implied or intended but not explicitly stated. The use case is a client submits multiple ID values to a links endpoint and then takes actions while it iterates through the results. Once it sees that the next ID in the result is different, it knows it has all the links for the current ID and can finish that chunk of work before moving on. Without this requirement, the client would have to read the entire result into memory to see if there are more links for a specific ID.

In the past, we have come to regret designs that do not permit streaming (special NULL value in VOTable lead to a per-row null mechanism being added, TAP services cannot generally stream FITS tables because they have to know the row count, etc) so this is more of a general design choice to enable streaming.

I see streaming as a valuable feature that will make DataLink suitable for more kinds of problems and usage scale. However, I agree that it could make existing DataLink services not fowards compatible. That means clients would have to know if they are talking to a 1.0 links endpoint and take precautions... we could canvas people that have implemented DataLink to see if this is a problem, but the fact that they are probably not registered is something to consider.

We could pull this one thing out and put forward an errata -- that would both give a formal process for impact/feedback and if it is OK then it won't a version-specific thing to worry about in future. This sort of matches the impetus to add this change to the text.

pdowler commented 3 years ago

In the CADC web portal, users query and can select datasets for download (checkboxes). The next stage is to call the links endpoint for each selected row and pick the appropriate links from there. If you call links with one ID at a time, you have a lot of extra overhead compared to calling it with multiple ID values in a batch; there is a point after which increasing the batch size further no longer helps -- so for the client there is an optimal batch size and that size is larger than the amount of data I like to keep in memory at one time, hence a design that allows streaming and finite/fixed memory use.

In our server side implementation, I also stream the links output so naturally all the links for an ID value are grouped together in the output. That code only needs to keep one ID's worth of output in memory at one time so while you can always DoS it with many calls (without an async mode, that's a problem for the deployer to solve and not my code) it can't be easily killed with a few large calls.

Of course, I can still do things this way without that "consecutive" requirement, but then my DataLink client code will be relying on behaviour of my own service. yuck.

msdemlei commented 3 years ago

On Tue, Nov 03, 2020 at 09:52:13AM -0800, Patrick Dowler wrote:

overhead compared to calling it with multiple ID values in a batch; there is a point after which increasing the batch size further no longer helps -- so for the client there is an optimal batch size and that size is larger than the amount of data I like to keep in memory at one time, hence a design that allows streaming and finite/fixed memory use.

So, what if you said something like: "To facilitate consumption of large datalink results in streaming mode, all links for a single ID value MUST be served in consecutive rows in the output."? I, for one, would be happy with this level of implicit use case binding.

The changelog then should say something like:

"Potentially breaking change: We now require links for a single ID to be consecutive in the output table. Since this is very natural in implementation, we belive all existing services are already compliant, so that we do not see the as a reason for a new major version."

Or so. Of course, that's not really true, because DaCHS currently doesn't work that way -- as I had suspected, it will return the rows in some happy disorder because for its implementation, it wasn't natural at all to keep them grouped. But then, of course, DaCHS' implementation grew out of the original getData proposal (that, in particular, didn't have multi-ID to begin with) and thus probably hasn't chosen the most natural approach for datalink.

Anyway, I'm not sure that any DaCHS has ever served a "production" multi-ID request, and if I put in a change now, by the time they might come in, most deployers will probably have upgraded.

Just one question: What do you do with the service blocks if you stream? You'll have to accumulate these anyway, no?

pdowler commented 3 years ago

Changing language to tie to streaming use case.