Revise model retrieval api for scalability

karlcz commented 7 years ago

The current idiom is that apps like Chaise end up retrieving the entire catalog model via GET /ermrest/catalog/N/schema so they can use model-aware techniques which require knowledge of relationships between tables.

However, as catalog models get more complex, this request becomes slower and the response larger. This is due to several factors:

The actual model in the service is represented as a graph of many small Python objects in our own classes
The serialization includes a two step process to convert to basic dict, list, and scalar types followed by JSON serialization of those generic types. This involves allocation and recursion over many objects all the way down to the column-level granularity of the model (and the per-column annotations and ACLs).
The pre-json conversion includes computing custom values per client for rights summaries etc. and the cost of this also scales with graph size
Due to client-specific values, the response cannot be precomputed nor cached and reused across clients.

The most obvious approaches to improving costs are:

Revise model document syntax to be more compact
- Requires an upgrade/transition story to deal with client compatibility
Try to reduce runtime cost of current serlalizer to defer the problem
- No change for client compatibility
- Not clear how limited the impact on internal service code might be, nor how feasible this is
Redesign the api so that clients can retrieve a subset of the model in practice
- Requires an upgrade/transition story to deal with client compatibility
- Might get too baroque to try to address the kinds of subsets needed by real clients

A hybrid approach for the last item above would be to just augment the current API with a list of inbound references for each table. Then, a client could walk the graph via a chain of much cheaper calls which may not be too costly with HTTP/2. This is possible because the current API already supports retrieving individual table documents instead of all schemas at once.

Get central table's description (exists now)
Get description for each distinct table listed in central table's outbound foreign keys (exists now)
Get description for each distinct table listed in central table's inbound foreign keys (new feature)
Get description for each distinct table listed in central table's alternative tables annotation (exists now)
Recursively repeat the above to some depth to handle UX denormalization requirements

karlcz commented 7 years ago

@hongsudt @robes

karlcz commented 6 years ago

Just a note for any future return to this topic: any attempt to discover the correct sparse subset of the model for a given chaise application instance would also require interpretation of the new pseudo column annotations which may pull in more tables that are further from the main table being displayed.

informatics-isi-edu / ermrest

Revise model retrieval api for scalability #174