[RFC] Dictionaries support

vasiliy-t commented 5 years ago

Motivating example

Application dealign with User's, User's have Attribute's, each Attribute has type, and there is a dictionary table (~10k records) linking attribute type to attribute name localization table and attribute groups table. Users stored on vshard storages and accessed through vshard routers.
There is an REST API /profile endpoint returning user profile info, including grouped attributes with localized names.

Problem statement

This example illustrates a use case for dictionary. There are not much records in dictionaries, dictionaries required to process almost any request, dictionary data is defined by user, not configuration, at some point dictionary data changes rarely.

There are several ways how to deal with dictionaries in sharded cluster:

shard dictionaries records and then get them on router with additional requests to storage, to mitigate additional network roundtrips it's possible to cache them on router for a short period of time, requires additional caching and expiration logic for dictionaries and some cache warm up time
provide each storage instance with it's own full copy of dictionaries so each storage is able to get required data locally
set standalone tarantool instance to handle dictionaries, but it's obvious that dictionaries is the most accessed data in this case, so there could be cases where one instance is not sufficient and also requires additional network roundtrips

Seem like to store a copy of dictionary on each storage is the best option but requires additional application logic - when new instance is set up dictionaries must be there before node starts processing requests, dictionaries updates must be processed consistently on each instance.

This seems like pretty common use case and it seems reasonable to implement dictionaries support directly in vshard.

Gerold103 commented 5 years ago

At first, the text is too big, full of your application details and hard to understand. Please, rephrase what you want in a more common terms. At second, I am sure that such 'lua-sharding' is not a common thing that can not be implemented on current vshard as an application. Vshard shard buckets consisting of tuples from spaces, not application nor language-specific in-memory data.

Gerold103 commented 5 years ago

Just for record - I would have understood an idea to shard additionally any user data, but I had not understood the text in the first comment and why it should shard only dictionaries. My proposal would look like this: I provide to a user an interface, a set of hooks, which vshard calls when tries to reshard. A user should implement this interface so as to return an iterator from which vshard fetches data and transfers it. On a destination storage another user hook is called which applies the data. It would allow to do not depend on type of data. An example of interface to register your iterators.

--
-- Register a custom sharded storage. Can be different from space.
-- @a storage is an object having methods:
--
-- * storage.iterator(bucket_id)
-- Get an iterator object for a specified bucket and having
-- method next(), returning a next object in this bucket of
-- this storage.
--
-- * storage.store(bucket_id, object)
-- Store an object, transferred from a remote storage.
--
-- * storage.gc(bucket_id)
-- Remove content of a specified bucket.
--
function vshard.storage.register_custom(name, storage)
-- ...
end

Gerold103 commented 5 years ago

After a verbal discussion it appeared, that 'dictionary table' here is a space, which should be fully stored on each instance in the cluster. In fact, this is a feature request for https://github.com/tarantool/tarantool/issues/3982. In case of urgency this issue can be solved without the core support via a special cluster-wide bucket.

tarantool / vshard

[RFC] Dictionaries support #172

Motivating example

Problem statement