spraakbanken / karp-backend

Karp backend
MIT License
3 stars 2 forks source link

Data consistency: queue tasks in transaction #277

Open majsan opened 3 months ago

majsan commented 3 months ago

We use MariaDB for persistence and ES for search. When we save something in MariaDB, we are not sure that it will be added to ES, since ES might be down or not answer for some reason. Currently we have no way to even detect if something like that happens (though it should be possible to see in logs, in theory).

To solve this, add a new table for queuing tasks to be done in the background. The tasks can be adding, deleting or updating entries in ES. The tasks are added in the same transaction as adding, deleting or updating in MariaDB. Some process will look in the tasks-table and do the tasks in the background. If the task was completed, the process removes the row in the table.

What can still happen is that the background worker fails to remove the row in the tasks-table even though the task succeeded. Because of this, it is important that the tasks are idempotent, i.e. can be run again without breaking the data consistency. Also, the order of the tasks matter, for example if the same entry is edited again before the worker has processed the first edit.

nick8325 commented 3 months ago

We also need some way of handling plugins. E.g. in the places repository we use the link plugin to fetch info from the municipalities repository. Then if we e.g. update the entry for Göteborgs kommun we also need to reindex all places that refer to Göteborgs kommun.

nick8325 commented 3 months ago

In particular, here are some things that ought to work:

  1. If we update or delete the entry for Göteborgs kommun, then we should reindex the entry for Göteborg.
  2. Suppose that to begin with we have an entry for Göteborg but no entry for Göteborgs kommun (so the link plugin returns no data). If we add an entry for Göteborgs kommun, we should reindex the entry for Göteborg.
  3. If we update the resource config for the municipalities resource, then we should reindex the places resource too.

I had been thinking about keeping track of when one entry depends on another (e.g. Göteborg depends on Göteborgs kommun), but that doesn't work in case 2 since there's no existing entry to depend on. I guess in this case we could run an Elasticsearch query to find out which places to update - give me all places that refer to Göteborgs kommun (or rather municipality = 1480)? Not sure how this would work in general.

majsan commented 3 months ago

Using Elasticsearch to find all references is probably enough and that has to be in the task table as well, created in the same transaction as the original edit.

majsan commented 3 months ago

Since the linking is done in a plugin, I guess the plugin should be responsible for this behavior. Not sure how to trigger it though, since the resource being referenced doesn't know anything about the plugin.