openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 3 forks source link

Register v2: Combine other sources from SK and DK in BODS Register V2 #164

Open spacesnottabs opened 1 year ago

spacesnottabs commented 1 year ago

Currently when transforming, each of the different BODS v2 sources (PSC, DK, and SK) all get transformed using a different Elasticsearch index.

This is intentional, as it is necessary to output the statements coming purely from each source.

However, the goal of the Register should be to allow the information from the different sources to be combined, as the existing register does for relationships - see Xerox Limited for an example: https://register.openownership.org/entities/59b978b167e4ebf340844101/graph

The natural person owner information is from the SK datasource, where the company relationships come from the PSC datasource.

There are a few options for this:

tiredpixel commented 1 year ago

I've spent some time looking into how the Elasticsearch indexes, the document schema, and the register-v2 search functionality are implemented.

In particular, I focussed on bods_v2_dk_prod100, bods_v2_psc_prod100, bods_v2_sk_prod100, since there didn't appear to be DK or SK corresponding indexes for bods_v2_psc_prod2.

The three indexes for DK, PSC, and SK appear to have the same schema.

Storing data in multiple Elasticsearch indexes is very common, particularly for time series data. Although the use of multiple datasources is a little unusual, here, it basically corresponds to the same thing, and I foresee no problem with it.

There shouldn't be any need to do this dynamically with multiple queries. Similarly, there shouldn't be any need to create a separate stage or separate index combining the data to accomplish this task's goal. Rather, it's possible to address an Elasticsearch query against an index using wildcards. Alternatively, the magic _all index could be used; however, I wouldn't recommend this, since it would make it much harder to segment multiple imports (e.g. current prod100 vs prod2 suffixes), and there's no need.

By way of example, consider Tesco Stores. Searching for tesco stores in register-v2 notes 8 results, and shows 2 rows (presumably, this is as a result of deduplication):

Screenshot from 2023-07-03 16-15-01

Debugging shows this is generated by the Elasticsearch query:

{"bool":{"must":[{"bool":{"should":[{"match_phrase":{"name":{"query":"tesco stores","slop":50}}},{"nested":{"path":"names","query":{"bool":{"must":[{"match_phrase":{"names.fullName":{"query":"tesco stores","slop":50}}}]}}}},{"match":{"company_number":{"query":"tesco stores"}}}],"minimum_should_match":1,"filter":[]}}]}}

I haven't looked into why slop is set for each.

I also haven't looked into why there is a match on company_number, since I wasn't able to find this field in the Elasticsearch indexes.

Inspecting the Elasticsearch indexes shows that there are indeed 8 hits for tesco stores in PSC, and an additional 1 hit in SK. There are no hits in DK.

Not being entirely certain of all the effects of overriding BODS_INDEX in register-v2 to point to bods_v2_psc_prod100, and wanting to understand more about how the search functionality currently works, here I selectively overrode within register-sources-bods repo RegisterSourcesBods.Repositories.BodsStatementRepository class search function, rather than changing the env var directly (if it is safe to do so, perhaps that can be done instead):

diff --git a/lib/register_sources_bods/repositories/bods_statement_repository.rb b/lib/register_sources_bods/repositories/bods_statement_repository.rb
index a474e88..ecf6a44 100755
--- a/lib/register_sources_bods/repositories/bods_statement_repository.rb
+++ b/lib/register_sources_bods/repositories/bods_statement_repository.rb
@@ -290,6 +293,10 @@ module RegisterSourcesBods
       end

       def search(query, aggs: nil, page: 1, per_page: 10)
+        puts "s" * 80
+        index = 'bods_v2_sk_prod100'
+        puts index
+        puts query.to_json
         if (page.to_i >= 1) && (per_page.to_i >= 1)
           page = page.to_i
           per_page = per_page.to_i

Doing so and searching for tesco stores yields no results, despite them being found in Elasticsearch. This is because they are filtered post-search in register-sources-bods repo RegisterSourcesBods.Repositories.BodsStatementRepository class search function get_bulk, so here I overrode the index too:

diff --git a/lib/register_sources_bods/repositories/bods_statement_repository.rb b/lib/register_sources_bods/repositories/bods_statement_repository.rb
index a474e88..ecf6a44 100755
--- a/lib/register_sources_bods/repositories/bods_statement_repository.rb
+++ b/lib/register_sources_bods/repositories/bods_statement_repository.rb
@@ -50,6 +50,9 @@ module RegisterSourcesBods
       end

       def get_bulk(statement_ids)
+        puts "b" * 80
+        index = 'bods_v2_sk_prod100'
+        puts index
         return [] unless statement_ids && !statement_ids.empty?

         process_results(

The post-search filter rather surprised me, as it seemed to result in multiple calls to the same indexes, whereas in fact the results were already present in the first query. However, I didn't look into the rationale for this.

Searching for tesco stores in this SK index yields 1 result and 1 row, as we would hope:

Screenshot from 2023-07-03 16-34-53

This validates that the SK index is able to be read without difficulty using the existing code.

However, since Elasticsearch also supports the wildcard format, we can in fact make this change instead:

diff --git a/lib/register_sources_bods/repositories/bods_statement_repository.rb b/lib/register_sources_bods/repositories/bods_statement_repository.rb
index a474e88..83900d8 100755
--- a/lib/register_sources_bods/repositories/bods_statement_repository.rb
+++ b/lib/register_sources_bods/repositories/bods_statement_repository.rb
@@ -50,6 +50,9 @@ module RegisterSourcesBods
       end

       def get_bulk(statement_ids)
+        puts "b" * 80
+        index = 'bods_v2_*_prod100'
+        puts index
         return [] unless statement_ids && !statement_ids.empty?

         process_results(
@@ -290,6 +293,10 @@ module RegisterSourcesBods
       end

       def search(query, aggs: nil, page: 1, per_page: 10)
+        puts "s" * 80
+        index = 'bods_v2_*_prod100'
+        puts index
+        puts query.to_json
         if (page.to_i >= 1) && (per_page.to_i >= 1)
           page = page.to_i
           per_page = per_page.to_i

Searching for tesco stores reports 9 results and shows 3 rows:

Screenshot from 2023-07-03 16-38-18

Defining an index pattern on bods_v2_*_prod100 within Kibana and using it to search:

name:"tesco stores" or names:{ fullName:"tesco stores" }

corresponds to Elasticsearch query:

{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"match_phrase":{"name":"tesco stores"}}],"minimum_should_match":1}},{"nested":{"path":"names","query":{"bool":{"should":[{"match_phrase":{"names.fullName":"tesco stores"}}],"minimum_should_match":1}},"score_mode":"none"}}],"minimum_should_match":1}}],"filter":[],"should":[],"must_not":[]}}

which is almost the same as register-v2 uses. Formatting the documents into columns shows:

Screenshot from 2023-07-03 16-40-24

Thus, the results found using the wildcard syntax appear to match the data within the multiple Elasticsearch indexes.

Considering this example, I would suggest that register-v2 (possibly also register-sources-bods if needed) are reviewed and updated to ensure that BODS_INDEX is safe to set to any index, with or without wildcards. As such, register-v2 would no longer be considered as linked to a particular index, only querying indexes using a particuar pattern. In order to accomplish this, it would have to be ensured that nothing in register-v2 is writing or manipulating data directly in the indexes. Existing importers populating the indexes, however, would remain writing to specific indexes according to their current method.

If the end goal is to be able to search across multiple datasources at once, this should be sufficient. If, however, the goal is to do something more than this, that will have to be considered additionally.

It may be useful to consider whether Elasticsearch index templates might be of use to us, in the future, particularly with regards to indexing. However, I consider this out-of-scope for this task.

StephenAbbott commented 1 year ago

Thank you for the detailed explanation, @tiredpixel. You're correct that the end goal is to search across multiple datasources at once.

How might using Elasticsearch index templates help in future?

tiredpixel commented 1 year ago

You're welcome, @StephenAbbott . In that case, it looks like we can proceed with https://github.com/openownership/register-v2/pull/11 , once @spacesnottabs has confirmed about whether the errors are related. Then, we will need to apply the config change to Heroku and restart the app.

Elasticsearch indexes (or indices, as is the Elastic-accepted plural) can be schema-less, similar to other document stores such as MongoDB. But Elasticsearch indexes are not really the same as MongoDB indexes; in many ways, they're more similar to MongoDB collections. In MongoDB, collections cannot be efficiently searched without additional indexes (roughly speaking). But in Elasticsearch, the documents can be efficiently searched by default, because of dynamic mappings. To more efficiently describe the schema of Elasticsearch documents within an index, these mappings can be declared explicitly. This is what happens in RegisterSourcesPsc::Services::EsIndexCreator create_es_index, for example. But these mappings are currently being stored on each index.

This is fine, but since the schemas are the same between different datasources (i.e. PSC, DK, SK), those mappings are also the same (this is what makes the use of BODS_INDEX with a wildcard possible). In Elasticsearch, index templates can be used to group common metadata about multiple indexes. Similar to the setting of BODS_INDEX with a wildcard, an index pattern can be defined (e.g. bods_v2_*_prod100 or even bods_v2_* since the schema across not only multiple datasources but also multiple import versions is the same), and then additional properties declared. These are inherited automatically by all matching indexes (it's possible to resolve more complex inheritance logic through the use of a precedence system). So, instead of declaring mappings each time an index is declared, it's possible to declare them only once (e.g. for BODSv2), and then simply import the documents into their respective indexes. These could not only be for multiple versions, but from multiple datasources; it doesn't matter to Elasticsearch, and all of them would have the mappings present automatically, and no per-index maintenance or preparation would be needed. It also has the nice effect of standardising the schema for that particular set of indexes, and making it clear without having to check for any drift in schemas (although note that such a drift wouldn't be an error anyway, from Elasticsearch's view—similar to in MongoDB).

tiredpixel commented 1 year ago

@StephenAbbott, what is the status of this ticket? Has it passed testing? Is it able to be marked as completed?

tiredpixel commented 1 year ago

@StephenAbbott, @spacesnottabs, if I understand correctly, the related fixes were made as part of the v2 release. Is it now possible to verify the final testing for this ticket? Thanks.

StephenAbbott commented 1 year ago

@tiredpixel There's an associated issue - https://github.com/openownership/register/issues/181 - which came up through testing and which needs to be worked on before we can close this ticket

tiredpixel commented 8 months ago

Just noting that QC of this is still blocked by #181 , as far as I'm aware.