ispras / lingvodoc-react

Apache License 2.0
7 stars 11 forks source link

Rewrite sync module in optimal way #974

Open al-indigo opened 1 year ago

al-indigo commented 1 year ago

We have a module that is responsible for cross-site sync. Due to historical reasons that module (https://github.com/ispras/lingvodoc/blame/heavy_refactor/lingvodoc/views/v2/sync.py) works in a batch mode which means that it grabs whole changes that we need to sync and sends it in a single request/transaction.

That will not work anymore (at all) since it consumes memory as a satan. We need to make it another way. The main points are:

  1. The code should form reasonable (and configurable) parts for sync process. Instead of grabbing all the dictionaries with all the contents we should make reasonable parts: the dictionaries themselves, the batches of lex entries, etc in configurable batch sizes (100 per sync request for example).
  2. The tasks should be made as our async tasks (that can be executed by Celery or by our forking mechanism)
  3. The sync tasks should cache the sync result in local cache. If the main server has consumed (or given to us) an update successfully, we should mark synced data in Redis instead of re-acquiring the state of the main server.
  4. The tasks for sync should form a linear queue. None of the lex entries should be sent to main site until the parent objects are really created (e.g. dictionaries, perspectives, fields). Their existence should be proved by 2.
  5. All the login process should use the external IAM service: Keycloak. We should make authentication and prove authorization for objects through our central Keycloak instance. @princessfruittt should make up a way to give a sequence of client_ids at this side eventually
  6. (possibly hard) If it's possible to prioritize these sync transactions, they must have the highest priority. I didn't find a way to do it yet through sqlalchemy
  7. (possibly hard) If there is a way to use near-binary algorithms for data transmission/serialize/deserialize that makes sense to use it there. I know about protobuf/grpc/http 2.0, not sure that there is nothing more appropriate here.
  8. The web-interface for that feature exists. Maybe it needs to be revised (but I'm ok with the last version except it is synchronous). After the changes 1-5 it obviously should be async.
al-indigo commented 1 year ago

This issue has low priority for now (let's look for tomorrow meeting though), but is pretty important itself and seems to be time-consuming, so I've pinned it for a while.