New table API endpoints for lower level table access

knabar commented 2 weeks ago

This PR adds two new endpoints for lower level table access:

For querying: /webgateway/table/123/rows/ takes a query argument in the usual format and returns matching row numbers. No actual data is returned. Limited paging support by always returning up to MAX_TABLE_SLICE_SIZE rows and taking a start argument from where to start the search. rowCount and end are returned to easily detect the end of results or need for additional calls (with new start set to previous end).
For data retrieval: /webgateway/table/123/slice/ takes numeric lists of rows and columns and performs a slice. Data is returned in columnar format (vs. rows in the current table API). The number of requested rows times columns cannot exceed MAX_TABLE_SLICE_SIZE, which defaults to 1 million. Both GET and POST are supported to allow for large numbers of rows and columns (exceeding the possible query string length). For retrieval of consecutive rows or columns, ranges can be specified as e.g. 1-5 instead of 1,2,3,4,5.

Examples:

/webgateway/table/123/rows/?query=object%3E100000&start=8470

{"rows":[8470,8471,8472,8473,8474,8475,8476],"meta":{"rowCount":8477,"start":8470,"end":8477}}

/webgateway/table/123/slice/?rows=8005-8010&columns=0-1

{"columns":[[114391,114392,114393,114394,114395,114396],[NaN,NaN,NaN,NaN,NaN,NaN]],"meta":{"columns":["object","name"],"rowCount":8477}}

Notes:

Added a way to supply additional formatting options to json.dumps, allowing e.g. for removing whitespace from the output
Unrelated to this PR, since NaNs are allowed in the output but not supported by plain JSON, clients can use JSON5 for parsing

will-moore commented 1 week ago

It looks like rowCount and end are based on the whole table rather than the results of the query: e.g. on merge-ci with webgateway/table/22909/rows/?query=(Count>12)%26(Count<23) I'm seeing:

{
"rows": [
  1,
  3
],
"meta": {
  "rowCount": 13,
  "start": 0,
  "end": 13
}
}

Is that expected?

knabar commented 1 week ago

@will-moore Yes, that is expected.

The total number of results is not available, since the query is not run on the whole table, but only from start to start+MAX_TABLE_SLICE_SIZE.

Figuring out the total is left to the client, by adding up the number of returned rows for each call. For tables with less than MAX_TABLE_SLICE_SIZE rows, it'll be just one call anyway.

end is not actually the number of rows of the table, but the row where querying ended, so the client can easily continue querying by passing in the end of the first query as the start of the second. end==rowCount indicates that there is nothing else to query.

start is there to indicate at which row the querying started, which is redundant, since the client would know anyway, but perhaps there are situations where it is useful.

will-moore commented 1 week ago

Everything is working just fine for /slice when I use the query correctly, so I tried playing about a bit (Not all these need to be fixed/changed)... If I make a mistake, the error handling is varied:

E.g. If I make an invalid query like this: /slice/?rows=3-1&columns=oops I get a nice "error": "Need to specify comma-separated list of rows and columns".

If I forget what needs to be in the query e.g. /slice/?rows=3-1 I get a less friendly:

"message": "'NoneType' object has no attribute 'split'",
"stacktrace": "Traceback (most recent call last):\n  File \"/home/omero/workspace/OMERO-web/.venv3/lib64/python3.9/site-packages/omeroweb/webgateway/views.py\", line 1447, in wrap\n    rv = f(request, *args, **kwargs)\n  File \"/home/omero/workspace/OMERO-web/.venv3/lib64/python3.9/site-packages/omeroweb/webgateway/views.py\", line 3586, in perform_slice\n    for item in source.get(\"columns\").split(\",\")\nAttributeError: 'NoneType' object has no attribute 'split'\n"

if I use the range the wrong way around I get the whole table (regardless of the number of rows and columns):

?rows=3-1&columns=4-3

If I go out of range,

/slice/?rows=3-1&columns=1-100

I get "error": "Error slicing table". Don't know how hard/expensive it is to check this before trying the slicing to give a more helpful message?

knabar commented 1 week ago

@will-moore added better error handling and checks

knabar commented 1 week ago

Made some more convenience changes:

Added a collapse query string argument for /webgateway/table/123/rows/ that collapses sequential row numbers into the same format that is supported by the /slice/ call, so the results can be passed back in easily while significantly reducing the amount of data transferred:
```
{
"rows": [
2,
3,
"5-7",
9,
10,
"12-20",
"22-27",
"33-35"
],
"meta": {}
}
```
Added additional items to the returned metadata, including columnCount and maxCells to give more information to the client on how to range check requests before submitting them, and partialCount, which is the number of matching rows returned from the /rows/ call. Note that this is not necessarily the number of matches in the whole table, which is why I called it partialCount, but open to suggestions for better names.
```
"meta": {
"partialCount": 25,
"rowCount": 14336,
"columnCount": 105,
"start": 0,
"end": 14336,
"maxCells": 1000000
}
```

chris-allan commented 1 week ago

I don't think having collapse is a good precedent to set. Looping over the results in pure Python is going to perform wildly differently depending on the result set so knowing whether to use collapse is not something that is easy to do.

Edit: Furthermore, I think it's a usability downgrade. The client then would need to decompress the result set of getWhereList() in order to know which rows actually match or use it with slice(). If you want contiguous slices then exposing read() is better anyway as it just takes the column numbers and a start stop.

knabar commented 5 days ago

Removed the collapse option after some performance testing that did not show a worthwhile improvement

ome / omero-web

New table API endpoints for lower level table access #564