openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 3 forks source link

Unreconciled entities when OpenCorporates ID case mismatched #189

Open tiredpixel opened 1 year ago

tiredpixel commented 1 year ago

There are two entities for brewdog plc:

These are expected to be reconciled into a single statement.

@StephenAbbott suggests that this might be because the OpenCorporates ID case is mismatched. This appears to be correct.

Looking into the Elasticsearch documents show that these entities correspond to:

However, https://opencorporates.com/companies/gb/Sc311560 redirects to https://opencorporates.com/companies/gb/SC311560 , confirming the expectation that the OpenCorporates IDs are not case-sensitive. This is supported by the fact that UK Scottish company numbers are not case-sensitive.

Fix by capitalising company numbers used to construct OpenCorporates IDs and URIs. It is likely acceptable to do this for all OpenCorporates IDs, not just UK Scottish ones.

tiredpixel commented 1 year ago

@spacesnottabs, I'm unsure about whether the linked PRs are sufficient, and whether it will be necessary to write a migration for existing Elasticsearch data. What do you think?

I'm also not sure about whether register sources-bods also has to be updated. I think not?

tiredpixel commented 1 year ago

@spacesnottabs, I've merged https://github.com/openownership/register-sources-bods/pull/36 , and updating the locks has been taken care of elsewhere. I'd still value your opinion on whether you think that that patch will be sufficient, etc. Thanks!

tiredpixel commented 1 year ago

I caught up with @spacesnottabs about this today. The conclusion is that the linked PRs are not sufficient. Also needing to be done is:

Estimate: 6h

tiredpixel commented 12 months ago

migrate existing BODS data in Elasticsearch

With the latest extensions and fixes, the documents themselves in the BODS indexes in Elasticsearch appear to have migrated successfully. The SC vs Sc issue now appears to be fixed. However, they still haven't been merged together. Hopefully, this will be sorted in the next reconciliation. Otherwise, more investigation will be needed.

consider existing raw data

A brief look at SC311560 indicates that this issue doesn't appear to be present in the raw records themselves.

consider response coming back directly from OpenCorporates API in Register

A brief look at SC311560 indicates that this issue doesn't appear to be present in the API response coming back from OpenCorporates.

tiredpixel commented 11 months ago

Just recording that after discussion with @StephenAbbott, the decision was made to wait until the next monthly bulk data import before verifying whether this issue is fully fixed.

tiredpixel commented 11 months ago

The monthly bulk data import hasn't fixed this. So further work is needed to investigate what can be done to ensure the statements get combined properly.

tiredpixel commented 11 months ago

Migration M20131205Resolve was created, to allow subsets of statements to be forced to resolve freshly. This was run for GB SC311560, and although statements were updated, the still weren't ultimately merged. This is despite ending up with the same name, same address, same OpenCorporates ID and case.

Another issue found along the way was with jurisdiction code, which suddenly flipped from gb to GB to one record after being resolved against OpenCorporates. This specific additional issue has been fixed. No occurrences of this issue in the rest of the data was found, so it must have been related to just those statements when the latest resolve was run.

However, I think I might have found out what's going on. It's pretty squirrelly, if so. My new theory is (considering GB SC311560):

If this theory is correct, the potential consequences are vast. Although migration could initially be scoped to ensure only certain affected statements are republished, the change to statement ID construction could potentially lead to almost every non-replaced statement getting republished.

Despite this, the following course of action suggests itself (initially scoped just to GB SC311560 for safety and in an attempt to test the theory):

During or by the end of that, the situation should be much clearer.

tiredpixel commented 11 months ago

I'm continuing to test this theory, but already it's looking like it's not quite correct.

After much journeying down the rabbit holes, I think this line is key:

https://github.com/openownership/register-sources-bods/blob/main/lib/register_sources_bods/services/pending_records.rb#L81

I'm still trying to work out how, though, and whether the multiple Register IDs indeed need to be added to the grouping array.

tiredpixel commented 11 months ago

I think I've found another issue. replacesStatements is excluded from the list of fields used to generate statementID. At first glance, this would seem to make sense, since it is a form of metadata about the statement, like publicationDetails. However, this seems to prevent the replaces statements logic from running in cases where statements are identical.

What to do about this isn't clear, however, since including replacesStatements would lead to constant republishing, and break all sorts of existing logic.

So this raises the question: Assuming two statements are identical, even with respect to identifiers (not the case for GB SC311560 because of Register IDs being different—but assume everything is the same), can those statements ever be merged with existing code? How could a new Statement C be published to replace both Statement B and Statement A after the interesting details having been merged, if the statement ID generated for all 3 would be the same (since they differ only in fields ignored for the hash)? The answer to this isn't clear to me, at present.

https://github.com/openownership/register-sources-bods/blob/main/lib/register_sources_bods/id_generators/entity_statement.rb#L17 https://github.com/openownership/register-sources-bods/blob/main/lib/register_sources_bods/id_generators/person_statement.rb#L17

tiredpixel commented 11 months ago

Consider this specific example:

{
  "statementID": "13424564937546064062",
  "statementType": "entityStatement",
  "statementDate": "2023-06-19",
  "isComponent": false,
  "entityType": "registeredEntity",
  "name": "BREWDOG PLC",
  "incorporatedInJurisdiction": {
    "name": "United Kingdom of Great Britain and Northern Ireland",
    "code": "GB"
  },
  "identifiers": [
    {
      "id": "SC311560",
      "scheme": "GB-COH",
      "schemeName": "Companies House"
    },
    {
      "id": "213800DAEV1T2UOHJE09",
      "scheme": "XI-LEI",
      "schemeName": "Global Legal Entity Identifier Index",
      "uri": "https://search.gleif.org/#/record/213800DAEV1T2UOHJE09"
    },
    {
      "id": "https://opencorporates.com/companies/gb/SC311560",
      "schemeName": "OpenCorporates",
      "uri": "https://opencorporates.com/companies/gb/SC311560"
    },
    {
      "id": "/entities/5457047613357657206",
      "schemeName": "OpenOwnership Register",
      "uri": "/entities/5457047613357657206"
    }
  ],
  "foundingDate": "2006-11-07",
  "addresses": [
    {
      "type": "registered",
      "address": "Brewdog, Balmacassie Commercial Park, Ellon, Aberdeenshire, AB41 8BX",
      "country": "GB"
    }
  ],
  "replacesStatements": [
    "8048965460103644359"
  ],
  "publicationDetails": {
    "publicationDate": "2023-09-09",
    "bodsVersion": "0.2",
    "license": "https://register.openownership.org/terms-and-conditions",
    "publisher": {
      "name": "OpenOwnership Register",
      "url": "https://register.openownership.org"
    }
  }
}
{
  "statementID": "1184629158496398812",
  "statementType": "entityStatement",
  "statementDate": "2023-06-19",
  "isComponent": false,
  "entityType": "registeredEntity",
  "name": "BREWDOG PLC",
  "incorporatedInJurisdiction": {
    "name": "United Kingdom of Great Britain and Northern Ireland",
    "code": "GB"
  },
  "identifiers": [
    {
      "id": "SC311560",
      "scheme": "GB-COH",
      "schemeName": "Companies House"
    },
    {
      "id": "213800DAEV1T2UOHJE09",
      "scheme": "XI-LEI",
      "schemeName": "Global Legal Entity Identifier Index",
      "uri": "https://search.gleif.org/#/record/213800DAEV1T2UOHJE09"
    },
    {
      "id": "https://opencorporates.com/companies/gb/SC311560",
      "schemeName": "OpenCorporates",
      "uri": "https://opencorporates.com/companies/gb/SC311560"
    },
    {
      "id": "/entities/5457047613357657206",
      "schemeName": "OpenOwnership Register",
      "uri": "/entities/5457047613357657206"
    },
    {
      "id": "/entities/4505511616172734357",
      "schemeName": "OpenOwnership Register",
      "uri": "/entities/4505511616172734357"
    }
  ],
  "foundingDate": "2006-11-07",
  "addresses": [
    {
      "type": "registered",
      "address": "Brewdog, Balmacassie Commercial Park, Ellon, Aberdeenshire, AB41 8BX",
      "country": "GB"
    }
  ],
  "replacesStatements": [
    "10587309494827194305"
  ],
  "publicationDetails": {
    "publicationDate": "2023-12-05",
    "bodsVersion": "0.2",
    "license": "https://register.openownership.org/terms-and-conditions",
    "publisher": {
      "name": "OpenOwnership Register",
      "url": "https://register.openownership.org"
    }
  }
}

These are two separate chains for GB SC311560. We wish them to be joined, somehow.

Here is the diff:

2c2
<   "statementID": "13424564937546064062",
---
>   "statementID": "1184629158496398812",
32a33,37
>     },
>     {
>       "id": "/entities/4505511616172734357",
>       "schemeName": "OpenOwnership Register",
>       "uri": "/entities/4505511616172734357"
44c49
<     "8048965460103644359"
---
>     "10587309494827194305"
47c52
<     "publicationDate": "2023-09-09",
---
>     "publicationDate": "2023-12-05",

That is, the only differences between these two unreplaced statements for the same entity are:

  1. a different statementID
  2. an extra Register ID in identifiers
  3. a different replacesStatements
  4. a different publicationDetails.publicationDate

(1) is obvious; these are different statements. (3) is how it connects to each separate chain. (4) doesn't matter, since it is ignored when generating a statement ID—and rightly so, because otherwise each time a new statement were published, it would update the date, and that would cause another statement to be published, ad infinitum.

(2) is the actual difference. Why it's different, is another story. But let's assume that we add this extra Register ID to 13424564937546064062. The fact that it's a Register ID doesn't matter at all, here; it would be the same story for an LEI or some other identifier. So, the additional identifier is added, which means a new statement needs to be published, with replacesStatements: [13424564937546064062], with the latest identifiers included.

But how to create a new statementID? The only difference for the hash will be the new identifiers. That creates a different statementID to be sure, so this would be possible were 13424564937546064062 the only statement. But in fact we already have 1184629158496398812, which already contains this additional identifier. So this new statement and the other chain's statement will have the same ID—which is not only ignored on republishing in Register, but in fact against BODS 0.2, since statements are only published once, and must have a new ID if there's a new statement.

We might think we can get around this somehow, such as by creating a new statement to merge both chains, containing the new identifiers. Good, something like this should be possible (even though I'm not sure whether there's existing logic in Register to handle this properly). But here we run into a similar problem: we would have all the identifiers, and publish replacesStatements: [13424564937546064062, 1184629158496398812]. But again since replacesStatements is excluded from the hash when statement IDs are generated, the hash will be the same as that for 1184629158496398812. Since that exists already, again it's not possible to publish that new statement.