Open tiredpixel opened 1 year ago
@spacesnottabs, I'm unsure about whether the linked PRs are sufficient, and whether it will be necessary to write a migration for existing Elasticsearch data. What do you think?
I'm also not sure about whether register sources-bods also has to be updated. I think not?
@spacesnottabs, I've merged https://github.com/openownership/register-sources-bods/pull/36 , and updating the locks has been taken care of elsewhere. I'd still value your opinion on whether you think that that patch will be sufficient, etc. Thanks!
I caught up with @spacesnottabs about this today. The conclusion is that the linked PRs are not sufficient. Also needing to be done is:
Estimate: 6h
With the latest extensions and fixes, the documents themselves in the BODS indexes in Elasticsearch appear to have migrated successfully. The SC
vs Sc
issue now appears to be fixed. However, they still haven't been merged together. Hopefully, this will be sorted in the next reconciliation. Otherwise, more investigation will be needed.
A brief look at SC311560
indicates that this issue doesn't appear to be present in the raw records themselves.
A brief look at SC311560
indicates that this issue doesn't appear to be present in the API response coming back from OpenCorporates.
Just recording that after discussion with @StephenAbbott, the decision was made to wait until the next monthly bulk data import before verifying whether this issue is fully fixed.
The monthly bulk data import hasn't fixed this. So further work is needed to investigate what can be done to ensure the statements get combined properly.
Migration M20131205Resolve
was created, to allow subsets of statements to be forced to resolve freshly. This was run for GB SC311560, and although statements were updated, the still weren't ultimately merged. This is despite ending up with the same name, same address, same OpenCorporates ID and case.
Another issue found along the way was with jurisdiction code, which suddenly flipped from gb
to GB
to one record after being resolved against OpenCorporates. This specific additional issue has been fixed. No occurrences of this issue in the rest of the data was found, so it must have been related to just those statements when the latest resolve was run.
However, I think I might have found out what's going on. It's pretty squirrelly, if so. My new theory is (considering GB SC311560):
RegisterSourcesBods::IdGenerators
do not exclude Register identifiers when calculating Entity or Person statements IDs. This is in contrast to statement IDs themselves, statement dates, publication details, etc.If this theory is correct, the potential consequences are vast. Although migration could initially be scoped to ensure only certain affected statements are republished, the change to statement ID construction could potentially lead to almost every non-replaced statement getting republished.
Despite this, the following course of action suggests itself (initially scoped just to GB SC311560 for safety and in an attempt to test the theory):
RegisterSourcesBods::IdGenerators
EntityStatement
to exclude Register IDsRegisterSourcesBods::IdGenerators
PersonStatement
to exclude Register IDsM20131205Resolve
migration to re-resolve select statementsDuring or by the end of that, the situation should be much clearer.
I'm continuing to test this theory, but already it's looking like it's not quite correct.
RegisterSourcesBods::Services::PendingRecords
process
.After much journeying down the rabbit holes, I think this line is key:
I'm still trying to work out how, though, and whether the multiple Register IDs indeed need to be added to the grouping array.
I think I've found another issue. replacesStatements
is excluded from the list of fields used to generate statementID
. At first glance, this would seem to make sense, since it is a form of metadata about the statement, like publicationDetails
. However, this seems to prevent the replaces statements logic from running in cases where statements are identical.
What to do about this isn't clear, however, since including replacesStatements
would lead to constant republishing, and break all sorts of existing logic.
So this raises the question: Assuming two statements are identical, even with respect to identifiers (not the case for GB SC311560 because of Register IDs being different—but assume everything is the same), can those statements ever be merged with existing code? How could a new Statement C be published to replace both Statement B and Statement A after the interesting details having been merged, if the statement ID generated for all 3 would be the same (since they differ only in fields ignored for the hash)? The answer to this isn't clear to me, at present.
https://github.com/openownership/register-sources-bods/blob/main/lib/register_sources_bods/id_generators/entity_statement.rb#L17 https://github.com/openownership/register-sources-bods/blob/main/lib/register_sources_bods/id_generators/person_statement.rb#L17
Consider this specific example:
{
"statementID": "13424564937546064062",
"statementType": "entityStatement",
"statementDate": "2023-06-19",
"isComponent": false,
"entityType": "registeredEntity",
"name": "BREWDOG PLC",
"incorporatedInJurisdiction": {
"name": "United Kingdom of Great Britain and Northern Ireland",
"code": "GB"
},
"identifiers": [
{
"id": "SC311560",
"scheme": "GB-COH",
"schemeName": "Companies House"
},
{
"id": "213800DAEV1T2UOHJE09",
"scheme": "XI-LEI",
"schemeName": "Global Legal Entity Identifier Index",
"uri": "https://search.gleif.org/#/record/213800DAEV1T2UOHJE09"
},
{
"id": "https://opencorporates.com/companies/gb/SC311560",
"schemeName": "OpenCorporates",
"uri": "https://opencorporates.com/companies/gb/SC311560"
},
{
"id": "/entities/5457047613357657206",
"schemeName": "OpenOwnership Register",
"uri": "/entities/5457047613357657206"
}
],
"foundingDate": "2006-11-07",
"addresses": [
{
"type": "registered",
"address": "Brewdog, Balmacassie Commercial Park, Ellon, Aberdeenshire, AB41 8BX",
"country": "GB"
}
],
"replacesStatements": [
"8048965460103644359"
],
"publicationDetails": {
"publicationDate": "2023-09-09",
"bodsVersion": "0.2",
"license": "https://register.openownership.org/terms-and-conditions",
"publisher": {
"name": "OpenOwnership Register",
"url": "https://register.openownership.org"
}
}
}
{
"statementID": "1184629158496398812",
"statementType": "entityStatement",
"statementDate": "2023-06-19",
"isComponent": false,
"entityType": "registeredEntity",
"name": "BREWDOG PLC",
"incorporatedInJurisdiction": {
"name": "United Kingdom of Great Britain and Northern Ireland",
"code": "GB"
},
"identifiers": [
{
"id": "SC311560",
"scheme": "GB-COH",
"schemeName": "Companies House"
},
{
"id": "213800DAEV1T2UOHJE09",
"scheme": "XI-LEI",
"schemeName": "Global Legal Entity Identifier Index",
"uri": "https://search.gleif.org/#/record/213800DAEV1T2UOHJE09"
},
{
"id": "https://opencorporates.com/companies/gb/SC311560",
"schemeName": "OpenCorporates",
"uri": "https://opencorporates.com/companies/gb/SC311560"
},
{
"id": "/entities/5457047613357657206",
"schemeName": "OpenOwnership Register",
"uri": "/entities/5457047613357657206"
},
{
"id": "/entities/4505511616172734357",
"schemeName": "OpenOwnership Register",
"uri": "/entities/4505511616172734357"
}
],
"foundingDate": "2006-11-07",
"addresses": [
{
"type": "registered",
"address": "Brewdog, Balmacassie Commercial Park, Ellon, Aberdeenshire, AB41 8BX",
"country": "GB"
}
],
"replacesStatements": [
"10587309494827194305"
],
"publicationDetails": {
"publicationDate": "2023-12-05",
"bodsVersion": "0.2",
"license": "https://register.openownership.org/terms-and-conditions",
"publisher": {
"name": "OpenOwnership Register",
"url": "https://register.openownership.org"
}
}
}
These are two separate chains for GB SC311560. We wish them to be joined, somehow.
Here is the diff:
2c2
< "statementID": "13424564937546064062",
---
> "statementID": "1184629158496398812",
32a33,37
> },
> {
> "id": "/entities/4505511616172734357",
> "schemeName": "OpenOwnership Register",
> "uri": "/entities/4505511616172734357"
44c49
< "8048965460103644359"
---
> "10587309494827194305"
47c52
< "publicationDate": "2023-09-09",
---
> "publicationDate": "2023-12-05",
That is, the only differences between these two unreplaced statements for the same entity are:
statementID
identifiers
replacesStatements
publicationDetails.publicationDate
(1) is obvious; these are different statements. (3) is how it connects to each separate chain. (4) doesn't matter, since it is ignored when generating a statement ID—and rightly so, because otherwise each time a new statement were published, it would update the date, and that would cause another statement to be published, ad infinitum.
(2) is the actual difference. Why it's different, is another story. But let's assume that we add this extra Register ID to 13424564937546064062
. The fact that it's a Register ID doesn't matter at all, here; it would be the same story for an LEI or some other identifier. So, the additional identifier is added, which means a new statement needs to be published, with replacesStatements: [13424564937546064062]
, with the latest identifiers included.
But how to create a new statementID
? The only difference for the hash will be the new identifiers
. That creates a different statementID
to be sure, so this would be possible were 13424564937546064062
the only statement. But in fact we already have 1184629158496398812
, which already contains this additional identifier. So this new statement and the other chain's statement will have the same ID—which is not only ignored on republishing in Register, but in fact against BODS 0.2, since statements are only published once, and must have a new ID if there's a new statement.
We might think we can get around this somehow, such as by creating a new statement to merge both chains, containing the new identifiers. Good, something like this should be possible (even though I'm not sure whether there's existing logic in Register to handle this properly). But here we run into a similar problem: we would have all the identifiers, and publish replacesStatements: [13424564937546064062, 1184629158496398812]
. But again since replacesStatements
is excluded from the hash when statement IDs are generated, the hash will be the same as that for 1184629158496398812
. Since that exists already, again it's not possible to publish that new statement.
There are two entities for
brewdog plc
:These are expected to be reconciled into a single statement.
@StephenAbbott suggests that this might be because the OpenCorporates ID case is mismatched. This appears to be correct.
Looking into the Elasticsearch documents show that these entities correspond to:
5457047613357657206
statementID=8048965460103644359
identifiers|schemeName=OpenCorporates
id=https://opencorporates.com/companies/gb/SC311560
4505511616172734357
statementID=4021538957744985257
identifiers|schemeName=OpenCorporates
id=https://opencorporates.com/companies/gb/Sc311560
However, https://opencorporates.com/companies/gb/Sc311560 redirects to https://opencorporates.com/companies/gb/SC311560 , confirming the expectation that the OpenCorporates IDs are not case-sensitive. This is supported by the fact that UK Scottish company numbers are not case-sensitive.
Fix by capitalising company numbers used to construct OpenCorporates IDs and URIs. It is likely acceptable to do this for all OpenCorporates IDs, not just UK Scottish ones.