wardle / hermes

A library and microservice implementing the health and care terminology SNOMED CT with support for cross-maps, inference, fast full-text search, autocompletion, compositional grammar and the expression constraint language.
Eclipse Public License 2.0
177 stars 21 forks source link

If indexing occurs in-place, then results during indexing may be incorrect #64

Closed wardle closed 7 months ago

wardle commented 7 months ago

The usual and suggested approach is to build immutable data files which are then used in production systems as-is.

There has been some interest in simpler deployments in being able to update-in-place. That means a file-based index is updated while it is being used by another process.

Both LMDB and Apache Lucene would support in-place updates. However, the current approach importing into the key-value store adopts a two-phase approach, in which entities such as descriptions and relationships are updated, and then all indexes are deleted and re-built. It is therefore possible that processes reading the index while it is being written read incorrect information.

LMDB serialises writes, and reads can run in parallel with a write transaction.

Options:

  1. Ignore the issue and recommend that users should not update in place
  2. During import, continue the 2-step write/index process but write values and then delete and re-index in a single write transaction. Readers will then always see 'correct' information
  3. Re-write import so that each write is indexed.

Option (1) works for me, but limits use for some users who want to update-in-place. Option (2) could easily be implemented by either using a parent transaction and nested r/w transactions or reworking to simply use one transaction for both. Option (3) is more complex to get right, because we'd have to incrementally write the index entries based on effectiveTime and active parameters.

The problem with option 3 can be illustrated with an example. Firstly an import occurs that imports a relationship. For each relationship, we create a parent index source--type--group--destination and a child index destination--type--group--source. Using these indices, we can then rapidly determine the child or parent concepts given a concept.

Imagine in one release there is an active relationship. Thus we create the appropriate indices. Now imagine we import two releases one after another but there is no guarantee that newer items will be imported after older items. As such, we may import an older item - this should be ignored if there is already a newer relationship. If not, and the relationship is active, we should create the required index entries. If the relationship is inactive, then we need to delete any existing index items for that relationship. The problem is that a different relationship id may be created for the same source-type-group-destination - so the issue then is, as the relationship index does not track the relationship id, only source--type--group--destination. We may end up removing an entry from the index because a relationship is inactive but the index should exist because that tuple was defined in a new active relationship. The fix for that is to include the relationship id in the tuple, at the end, but that a) increases index size, and, b) potential causes duplicates.

The simpler option is to delete all index entries and recreate in a single transaction. However, this does potentially mean that an update-in-place database will increase in size as there is more churn during index creation. A progressive approach to indexing, albeit with slightly more complex logic will likely allow lmdb to avoid creating additional empty pages as most index entries will not actually be deleted. This could be used for the refset index, which does include the UUID of the reference set, making it safe to index progressively.

The answer is therefore to benchmark indexing and assess database file size growth during index updates. A simple method would be build a test suite to import synthetic data and index, and then re-import the same data and re-index. Firstly, we'd expect data integrity from concurrent reads so that, for example, at no point does a request for the relationships of a concept return no results. Secondly, we'd expect minimal database file growth. Concurrent reads may lock pages and so force some growth, and that will be unavoidable.