Open timgdavies opened 7 years ago
We should presumably allow original research as provenance. This might itself reference other sources/provenances without modelling them all.
For example, someone who writes a paper summarising their research about Mrs Smith's corporate interests. This might include a novel analytic approach based on signals in tweets, or something.
In that case perhaps we should mandate or offer the use of DOIs as these "guarantee" that a publication cannot be changed after it has been published.
Just a hand-wavy thought.
@timgdavies Can you share your evaluation / analysis of PROV-O? I wonder if maybe there is a way to use it as 'PROV-O-lite'. In JSON-LD representations, it's often possible to reduce the complexity by having a lot of blank nodes (... not to get too far into implementation details).
In terms of other models for provenance, I remember OpenCorporates having a nice approach, but I couldn't immediately find the documentation. @sebbacon ?
A standard that might be worth looking at here is the Carnegie Museum of Art Digital Provenance Standard (and their Github). It is relatively simple (compared to PROV) but allows complex statements and looks like it handles uncertainty well. Would also probably be a good match for the original research @sebbacon as it allows provenance to run along the continuum of all-structured data to a free text description.
There is a lightweight version of PROV called PAV that may also be a good fit.
Based on the last working group call, the approach taken in Open Ownership, and goal of keeping things simple until we have reason for more complexity, I've removed provenanceStatements
and added a simpler Source
object which can attach at multiple levels.
This is documented on the master branch here.
This leaves some issues to be firmed up in future, around how 'types' of source should be defined in codelists, and how the identifier of the parties referenced as making source assertions should be represented (we suggest a URI right now), but I think those will belong in separate issues.
Re-opening this because I think there are grounds for a rethink based on #45.
Adjusting the provenance model may help us get around the competing use cases of users who want correct data versus those who want original data. Our current model of provenance is relatively simple, tying a source and an asserting party to a statement about an ownership relationship, a legal entity or a natural person. One option is to moved further towards the subtleties of the PROV data model, and allow statements that are derived from, or revisions of, the original documentation.
I think this goes back to @jpmckinney's point about 'PROV-O lite'. Essentially, the use case I see is that a third-party can publish a jurisdiction
field that says GB
and then point a derivedFrom
field back to an statement based on an original field that says england
.
From discussion with Jits today - we might want to look at an annotation based approach.
The W3C Annotations model works using the idea of annotations, bodies and targets, such that you might have a collection of annotations pointing at a particular identified BODS statement and particularly points within that statement.
A simple example of annotation of one of our examples (using a JSONPath Syntax instead of XPATH) would be:
{
"@context": "http://www.w3.org/ns/anno.jsonld",
"id": "http://example.org/anno5",
"type": "Annotation",
"body": {
"type" : "TextualBody",
"value" : "Extracted from Companies House",
"format" : "text/html",
"language" : "en"
},
"target": {
"source":"https://raw.githubusercontent.com/openownership/data-standard/master/examples/1-single-direct.json",
"selector":"$.statementGroups[?(@.id=='9b57a603-ffdb-413c-ab8d-50a8325820c8')].beneficialOwnershipStatements[?(@.id=='c4bbe7a6-ec59-43dd-8ef8-b6ec954a11b0')].entity"
}
This looks like it would get quite verbose for our needs, and the JSON paths are not very readable. It also does not offer keywords for provenance, though we could borrow these from PROV-O.
One option might be to create an `annotations
keyword, which can contain an object matching the structure of a BODS document, but turning each property into an object which can contain provenance and annotation-related keywords.
This could work something like the below.
{
"statementGroups": [
{
"id": "9b57a603-ffdb-413c-ab8d-50a8325820c8",
"beneficialOwnershipStatements": [
{
"id": "c4bbe7a6-ec59-43dd-8ef8-b6ec954a11b0",
"statementDate": "2017-03-25",
"entity": {
"id": "3af298b8-a8db-48a6-b95c-f60bc8473dae",
"statementDate": "2017-03-25",
"type": "registeredEntity",
"name": "CHRINON LTD",
"foundingDate": "2010-11-18",
"identifiers": [
{
"scheme": "GB-COH",
"id": "07444723"
}
],
"addresses": [
{
"type": "registered",
"address": "Aston House, Cornwall Avenue, London, N3 1LF",
"country": "GB",
"postCode": "N3 1LF"
}
],
"jurisdiction": "GB",
"uri": "https://beta.companieshouse.gov.uk/company/07444723"
},
"annotations": [
{
"provenance": {
"a_sourceType": "officialRegister",
"a_description": "Downloaded from Company Register"
},
"entity": {
"identifiers": [
{
"a_sourceType": "officialRegister",
"a_description": "Identifiers have been modified"
}
],
"addresses": [
{
"country": {
"a_originalValue": "Great Britain",
"a_description": "Normalised using a lookup table"
}
}
]
}
}
]
}
]
}
]
}
Or annotation could be allowed nested in the data at each level.
From #118, it might be worth considering the temporal aspects of sources in the Source
, as a flag to users that "what is published here may not be the latest version, go and look there".
if we want to flag to people the difference between sources that are explicitly point-in-time (due-diligence done on submission) and sources that are updated independently of a BO system (a PEP database that politicians are required to declare to on an annual basis).
From #179, transliteration / translation of names is something we need to settle on for 0.3. I think looping back to PROV-O would be useful here.
We have looked at the PROV provenance ontology but fear this may be too complex for our use-cases.
We are looking for good precedents for simple provenance information and will need to develop good guidance on writing and reading provenance chains.
As part of this, we are also explore issues of verification. What sort of statements should the schema allow about the way in which other information has been verified or certified?