Closed jpmckinney closed 5 years ago
This is a good and important discussion.
To weigh up, there are also risks of not having process level data in the project schema, in that:
Publishers are then only able to provide contracting process information if they implement both project level schema and OCDS, which makes adoption more challenging;
If we ask users who are looking for 'simple' values (like contractValue
) to use the OCDS data, we are placing the burden of working out which value is relevant on the user, instead of on the data producer;
In a distributed publishing approach, systems may not always want to 100% trust incoming OCDS data, and may have reasons to have some review process between reading in OCDS data, and updating the project level data.
Whilst in the case of 'perfect data' the project level data may be redundant, given experience of real-world publication practices, I don't think it will be.
However, I think we will need to do more in implementation guidance to make sure this doesn't cause confusion, and to mitigate the risks that it creates ambiguity for users and publishers. For this, I had anticipated we would be most likely to explore your option 2 above, in providing worked examples (and possibly code) to show how, in cases of good contracting data, it should be possible to populate the contractingProcess level data from OCDS.
The ContractingProcess
definition is not so far off from being aligned with release-schema.json. For example, if some fields were scoped within a tender
object or within a single-item awards
or contracts
array, then there would already be more alignment between the two. I don't think having these objects and arrays makes implementation and use too onerous on implementers or users. Essentially, my first, unnumbered proposal (and/or option 3) is to replace ContractingProcess
with a cut-down compiled release. I don't think that proposal contributes to the risks you list.
I'm not sure I see the grounds for your assertions that this would not make implementation and use too onerous, nor that the suggested approach doesn't contribute to the risks noted.
Your proposal appears to place the burden on interpreting 'summary values' onto the user rather than publisher of the data - which, from my experience of implementation - is something we want to avoid.
It may help to think about ContractingProcess
as a summary of detailed broken down releases - and to note that this exists because the CoST IDS (and it's associated business processes) generally operate at this level of summary, rather than at the level of disaggregated data. If we only offer the option of using releases, and ask people to create 'simulated' releases in cases where they have not collected disaggregated data - we are asking people to create fictional disaggregated data, rather than just to express the summary data they have, which I think risks more confusion than having the extra level of data and data model.
At @duncandewhurst's request I've done some more detailed exploration of modelling options here.
@duncandewhurst has raised that one method of alignment between ContractingProcess
and an OCDS release, to avoid similar-but-different data structures would be to, where a ContractingProcess
summary field would be derived from a particular field in OCDS releases, to follow the structure of the release.
For example, instead of the current structure which is relatively flat:
"contractingProcesses":[{
"procurementMethod":"",
"procurementMethodDetails":""
}]
we would have:
"contractingProcesses":[{
"tender":{
"procurementMethod":"",
"procurementMethodDetails":""
}
}]
Whilst this works for some fields (particularly from tender) is is more complex for others. For example:
contractValue
is currently defined as "The price agreed for the work covered by this contracting process. Where linked OCDS data is available this may by the sum of awards.value
, or, if contract values are available, the sum of contracts.value
" such that it would not be clear whether this should go into contractingProcess/0/awards/value
or contractingProcess/0/contracts/value
when applying such nesting.costEstimate
does not exist in OCDS, but instead exists in CoST IDS - and is a figure we relate to tender.value
, although accept it may also be entered manually. If we put this in tender/costEstimate
we may give the impression that OCDS has a tender/costEstimate
field.@jpmckinney's suggests above an alternative approach of "embedding/linking to a compiled release from ContractingProcess" instead of including fields in ContractingProcess
. In this situation, the following fields, which can be partially derived from a compiled release would be removed from ContractingProcess
:
title
- derived from release.title
, tender.title
, award.title
or contract.title
procurementMethod
- derived from tender.procurementMethod
procurementMethodDetails
- derived from tender.procurementMethodDetails
costEstimate
- derived from tender.value
in a release where tender.status
= 'planning'. (*1)contractValue
- derived from awards.value
or contracts.value
finalValue
- derived from contract.implementation.transactions
(*2)numberOfTenderers
- derived from tender.numberOfTenderers
or bids
(*3)contractPeriod
- derived from tender.contractPeriod
, awards.contractPeriod
or contracts.period
procuringEntity
- derived from parties
array administrativeEntity
- derived from parties
array (*4)suppliers
- derived from parties
array and award.suppliers
For the starred items, I can see specific complexity in relying on extracting data from compiledRelease
either due to the structure of a compiledRelease
or experience of current data supply, namely:
(1): the costEstimate
should be the tender.value
from the point of planning, and if tender.value
is revised during the tender process, costEstimates
should not be updated. However, in a compiledRelease, an earlier tender.value
becomes inaccessible, and is only available from is a versioned release is also created, or by looking at all releases.
(2) and (3): we don't see widespread use of transactions
or bids
features are present. In the case of transactions there may be cases where systems also want to fetch this data from a different finance system.
(4) administrativeEntity
is not currently included in OCDS, and we are not aware of implementers capturing this.
In all the other cases, we need to be aware that there will be cases where data from procurement systems may have (a) missing values; or (b) values that are incomplete or not 100% trustworthy.
The current design of ContractingProcess
provides redundancy for the cases where either the data in a compiledRelease is missing, or is not trusted as an accurate summary, such that those curating the ContractingProcess
summary want to over-ride it from some other source. It also places the decision making over where in the releases to extract data from onto the data producer, not the user.
The alternatives to this I think are to:
What I've seen so far makes me discount the first option. The second feels to me more complex both for data publishers, and for data uses, than having some redundancy in ContractingProcess
.
I think the current model strikes the right balance when it comes to implementation. The schema includes guidance on how ContractingProcess
fields may be derived from OCDS releases, but keeps this language at the level of MAY, rather than SHOULD, to ensure the project level data specification can be used in a wide range of implementation scenarios.
I will however do some work now on improving the language around ContractingProcess
to make sure it's status as a summary is clear.
My suggestion in https://github.com/open-contracting/infrastructure/issues/9#issuecomment-424138658 was the same as @duncandewhurst's regarding structure, so thank you for considering this option.
With respect to the proposal, I propose renaming ContractingProcess
and contractingProcesses
to reflect the semantics of being a summary, e.g. ContractingProcessSummary
and contractProcessSummaries
. Otherwise, there will be no indication within the data that these are intended to be interpreted by users as summaries.
With regards to specific comments:
partyRole.csv
codelist.finalValue
is available in an extension, e.g. where transactions data is unavailable or unreliable. Similar derivable fields can be added via extensions for e.g. costEstimate
.numberOfTenderers
in a summary, then presumably it can be provided as tender.numberOfTenderers
in OCDS format (no need for widespread use of bids).With regards to general comments on approach:
I don't understand in what sense any data here is 'fictional' or 'simulated'. Many implementers provide historical OCDS data. Wouldn't that also be fictional and simulated?
I also don't understand how making cosmetic changes to ContractingProcess
, to make it align more with the release schema (wrapping groups of fields with "tender": { … }
, "awards": [{ … }]
, and "contracts": [{ … }]
), can change the data from summary data to disaggregated data. A compiled release is already a summary; it's a merger of releases.
Lastly, gaps and inaccuracies should be addressed, but I'm not sure that adding another source (which may have its own gaps and inaccuracies) is the best solution. A common complaint among users is about discrepancies across different sources, and the proposal adds more opportunities for these. For example, when a summary contradicts OCDS data, a user can't tell whether it's because the publisher is correcting an error (which, presumably, it was incapable of fixing at the source) or it's because the publisher made an error in preparing the summary. The proposal does place the decision of how to extract information from OCDS data onto the publisher, but the user is still left with the challenges of comparing the summary and handling any discrepancies. The user might not be any further in making sense of the jurisdiction's data as a whole – unless they simply trust one source and ignore the others. In terms of trust, though, I don't estimate the likelihood that a publisher 'gets it right' with a new dataset to be significantly higher than with existing datasets. In short, I think users are ultimately better served by publishers focusing on the improvement of one dataset in one format, rather than making corrections to reformatted data. That said, my other comments are independent of that opinion.
On contractingProcesses
-> contractingProcessSummaries
- I think the documentation is already very clear about what contractingProcesses
should contain, and it is not true to say that it is only a summary found under here: it also provides access to releases with much more detail of the process level down.
I think you may be missing the important point here that not all implementers of OC for Infrastructure will have access to reliable incoming OCDS Data from which to populate their project level data, or there will be cases where that data is patchy because it comes from procurement systems over which they have limited control such that, for example:
ocid
does not produce an accurate picture of a process; orWhilst theoretically it may be possible to fix or extend the OCDS data, get more fields published within that data, and improve it's coverage, practical experience shows that (a) takes a long time; (b) requires political work to secure support for changes; (c) may be outside the control of the the team wanting to undertake infrastructure monitoring.
In these cases then, I see two options:
It is (1) I was referring to as 'simulated' data. Whilst I can see this has some merits as involving fewer new fields: (a) the releases and merging model of OCDS is not as tried-and-tested as I would like it to be in order to have confidence in this working well, and being really well understood; (b) it feels like it would be much harder to explain and architect for in implementation terms.
The design is based on a grounded exploration of real world data, and discussions about the real-world needs of infrastructure monitors.
I also don't think it is the purpose of a data model to resolve discrepancies between data sources. These need to be addressed at the level of business process and data quality management, not at the level of the data model. It is important to recognise that there can be good reasons for discrepancies and differences between the value two systems hold for a given concept, as well as bad reasons. Architectures that ignore this risk reducing the freedom of people to hold different views of a situation, or to carry out their business processes effectively.
Limiting discrepancies has to be a question of supporting implementation, creating validation tooling, offering reference implementations of systems that compute values, and monitor for discrepancies, and in providing guidance to users. I say this because:
I'll address the above comment in separate comments.
On
contractingProcesses
->contractingProcessSummaries
- I think the documentation is already very clear about whatcontractingProcesses
should contain, and it is not true to say that it is only a summary found under here: it also provides access to releases with much more detail of the process level down.
In OCDS, a record has two field for details (releases
and versionedRelease
– neither contains more information than the other) and a field for a summary (compiledRelease
– which contains less information, as prior values are omitted). If the data structure for a record were instead that the compiled release had a field for releases (like contractingProcesses
does), it wouldn't change the semantics of the compiled release being a summary. So, I think it is true that contractingProcesses
are summaries. The quoted comment seems to confuse semantics and structure.
Furthermore, we know from experience that the documentation might be clear, but people nonetheless interpret the schema based on the terms it uses. I think it would be prudent to use terms that are less likely to be to misinterpreted. I don't see a downside to the proposed change in term.
It seems Tim's comment is cut-off. Many comments seem tangential to the proposed changes – but I generally agree, and I fully understand the variety of scenarios this project is intending to support.
I, however, simply don't see much reason to have two very similar but incompatible schema for contracting data, and I don't see how aligning schema to achieve compatibility causes any harm.
I worked up an alternative project-schema.json to demonstrate. The changes are:
ContractingProcess
to Record
Record
have a field for the summary (compiledRelease
), and add a new CompiledRelease
definition for all its properties except for releaseList
externalReference
to id
tender
object around fieldsreleaseList
to releases
ReleaseListEntry
to LinkedRelease
uri
to url
in ReleaseListEntry
releases
from CompiledRelease
to ContractingProcess
With these changes, more of this project's schema is compatible with OCDS. That's not to say that data following this schema would be valid in OCDS (e.g. there are missing required fields), but this schema, at least, is not using different terms for the same concept.
There are now far fewer differences. If we assume the use of ocds_process_title_extension, this project's schema only adds:
id
: This might be a good addition to OCDS.variations
: See #5.focus
: If this were added to OCDS, we'd likely use a Classification object, to avoid the field being tied to infrastructure use cases only. To avoid adding complexity to this schema, and to avoid creating opportunities for future incompatibility, I suggest prefixing the field to scope it to infrastructure (discussion on naming in #11). It could then be treated like a local extension in the context of OCDS.status
: OCDS has no field for the status of the process as a whole, and is unlikely to add it, preferring to have narrower statuses. However, this doesn't represent an incompatibility.costEstimate
: In OCDS, this requires the use of tags in releases. So, it makes sense to add a new derived field in this schema, with a non-conflicting name.suppliers
: In OCDS, there are no fields about all awards. OK to add here.contractValue
: In OCDS, there are no fields about all contracts. OK to add here.finalValue
: In OCDS, there are no fields about all implementations. OK to add here.administrativeEntity
: This might be a good extension to OCDS.documents
: Since these documents can be a mix of planning, tender, award, contract, implementation, etc. documents, there is no correspondence in OCDS. OK to add here.Anyway, the point here is to promote compatibility. When there is compatibility, there's:
If you look at the changes, they are small. I hope they can be accepted.
I haven't updated the Markdown files, added or updated titles and descriptions in the schema, or re-indented the schema. I figure that can follow once agreed in principle.
From discussion with @kindly we think aligning the structure of contractingProcess
with compiledRelease
makes sense for the reasons at the end of @jpmckinney's most recent comment.
There are some concerns regarding naming however:
Rename
ContractingProcess
toRecord
The concern here is potential for confusion with an OCDS record, which is a different thing, because in OC for Infrastructure the ContractingProcess
doesn't necessarily have an associated list of releases and can be produced directly from data entered directly to an infrastructure transparency portal, rather than being compiled from a list of releases.
Make
Record
have a field for the summary (compiledRelease
)
Similarly, there is potential for confusion with an OCDS compiled release, which is a different thing, because in OC for Infrastructure the fields in compiledRelease
aren't necessarily generated through compilation from releases.
Whilst aligning with the structure of a compiledRelease
makes sense it would be preferable to maintain a distinction in the names of the properties so that users don't conflate elements of the OC for Infrastructure schema with OCDS records and compiledReleases.
We discussed alternative names for these two properties to make the distinction between OCDS and OC for Infrastructure clear:
ContractingProcess
to contractingProcess
compiledRelease
to contractingProcessSummary
However there are issues with the use of 'summary' since the data in contractingProcessSummary
could exist independently of any further detailed information (i.e. if no releases or variations exist).
Rather than mirroring the structure of an OCDS record and nesting contractingProcessSummary
within contractingProcess
, we propose aligning the structure of contractingProcess
with an OCDS compiled release, with releaseList
as a property and variations
nested based on the outcome of discussion in #5, for example:
{
"contractingProcesses": [
{
"ocid": "",
"releaseList": [],
"variations": [],
"tender": {
"procurementMethod": ""
}
}
]
}
This approach delivers the benefits of compatibility without conflation of terms with OCDS.
If we're happy with this approach I'll work up the schema in a PR
This sounds reasonable, but I'd like to push one point a bit. In OCDS, in contexts where publishers are unable to publish a series of releases, as I remember, our guidance has been to update a single release – essentially treating it as a compiled release. (I can't find an example – do you remember?) In such contexts, records wouldn't have a list of releases, and compiled releases wouldn't be the result of compilation. I'm not sure that there is a conceptual difference here.
The Record is described as:
An OCDS record must provide a list of all the existing OCDS releases relating to a single contracting process and should provide a compiled release containing the current state of all fields in the OCDS schema. An OCDS record may also provide a versioned history of all changes to the data in the compiled release.
And the compiledRelease is described as:
This is the latest version of all the contracting data, it has the same schema as an open contracting release.
Essentially, the two concepts are "all available data (within this system) about one contracting process", and "the present state of one contracting process". The concepts, from this perspective, are independent of how the data is generated. In general, our concepts should relate to facts about the world – not facts about the data production process. The fact of whether a compiledRelease was compiled or not is unimportant. What is important is that it is "the present state of one contracting process." (I admit the names of the two concepts need improvement.)
As such, I don't see a conceptual difference between the two concepts in the project schema and the two concepts in the record package schema.
As I understand, the differences between your proposal and mine are:
releaseList
instead of releases
. Can you explain?ContractingProcess
instead of Record
. The names of JSON Schema definitions are invisible to data users, so this is acceptable – though I'm not sure there is a real conceptual difference and would prefer consistency.compiledRelease
field. I think it's better to put the "present state" and the "full history" side-by-side, rather than embed the full history inside the present state. I don't see how conceptually a full history belongs inside a present state, how a full history is a child of a present state, etc. If we can't agree on compiledRelease
, then we can use another term. The word 'summary' (without using the name of the class (ContractingProcess
) as a prefix) is fine, because the data is a summarization of the real-world contracting process (every data representation lacks detail from the real world).@jpmckinney @duncandewhurst
My current preference is for it too look something like:
{
"contractingProcesses": [{
"contractingProcess": {
"ocid": "",
"variations": [],
"tender": {
"procurementMethod": ""
}
},
"ocdsReleases": []
}]
}
So in short:
compiledRelease
name it contractingProcess
(i.e the current state of the contracting process)ocdsReleases
.I do have a slight issue with contractingProcess
as a name as it is a singular version of its parent contractingProcesses
but it has more that one property within it. However, I agree with @jpmckinney that the ocdsReleases
should ideally not be nested within the contractingProcess
.
Nonetheless, I prefer this to the compiledRelease
as the compiledRelease in this case explicitly does not have the same schema as the releases.
So if we could come up with a better name for contractingProcess
/compiledRelease
that that would be ideal in my opinion. All the names I can come up with are a not great i.e currentProcessStatus
, currentProcessState
, ocdsInfrastructureProcess
In OCDS, in contexts where publishers are unable to publish a series of releases, as I remember, our guidance has been to update a single release – essentially treating it as a compiled release. (I can't find an example – do you remember?) In such contexts, records wouldn't have a list of releases, and compiled releases wouldn't be the result of compilation. I'm not sure that there is a conceptual difference here.
This is subtly different from what we have been recommending in this scenario, which is that publishers update a single release but update the release identifier when they do so, thus from the perspective of a user who, for example, scrapes the data daily, the publisher is actually publishing multiple releases even though only the latest release for each contracting process is available at any given time.
We have also been recommending that they publish these as releases (i.e. wrapped in release package) rather than as a compiledRelease (i.e. as part of a record) since a record must have a list of all releases, which such publishers cannot provide.
In terms of the descriptions, for record
:
An OCDS record must provide a list of all the existing OCDS releases relating to a single contracting process...
This isn't true for the proposed use in OC for Infrastructure, since there may be no list of releases.
For compiledRelease
:
...it has the same schema as an open contracting release.
This isn't true for the proposed use in OC for Infrastructure, since the schema differs.
Regarding the other points:
Keep releaseList instead of releases. Can you explain?
Sorry, that wasn't intentional. I prefer the name proposed by @kindly above (ocdsReleases
)
Avoid wrapping fields in a compiledRelease field. I think it's better to put the "present state" and the "full history" side-by-side, rather than embed the full history inside the present state. I don't see how conceptually a full history belongs inside a present state, how a full history is a child of a present state, etc. If we can't agree on compiledRelease, then we can use another term.
This was to avoid use of summary
and to get around needing a second name other than contractingProcess
, however if we can agree another term I'm happy to keep these side by side.
I understand that there are technical differences (releases
being a required field, etc.); my point has been that technical differences don't amount to conceptual differences. That said, the words for the OCDS terms imply technical details ('compiled' release), and the definitions describe technical details (releases
being required), and so OCDS' terms make a poor choice for reuse (until and unless at a later date we clearly separate the concept's semantics from its implementation details).
So, let's go with ContractingProcess
for the object definition.
For the list of releases, I'd prefer a term that doesn't imply implementation details, to go along with the type of separation described above (ocdsReleases
implies the use of OCDS). Why not simply releases
? There are no such things as 'OCInfra releases' – the only things called 'releases' are OCDS, so there is no opportunity for confusion. We currently have releaseList
; the typical pattern for naming a list of things is to use the plural, not to suffix 'List'. I don't think we need to start a pattern of prefixing a standard's acronym to terms reused from that standard.
As for contractingProcess
, having a field match the name of the class to which it belongs is confusing, e.g. a Cat
class with a cat
property would be surprising. The only reason given against summary
has been:
However there are issues with the use of 'summary' since the data in contractingProcessSummary could exist independently of any further detailed information (i.e. if no releases or variations exist).
As explained, I don't see how this is an issue. The summary would exist, while further detailed information would exist in the real-world, in contracting systems' databases, in other data formats, etc. We can call it a synonym like 'digest' if preferred.
Okay great, I'm happy to settle on using releases
and summary
as you propose.
I've updated the 9-demonstrator branch with these changes, along with updates to the titles, descriptions and markdown files (PR https://github.com/open-contracting/infrastructure/pull/29)
I've also reverted the id
back to externalReference
, to keep it distinct from the id
field in an OCDS compiledRelease, since the two fields have significantly different definitions (although the merging rules state that the id
field should be omitted from a compiledRelease in OCDS it is still present in the schema).
The CoST-IDS mapping refers only to
ContractingProcess
when mapping process-level information. However, I would have anticipated, at least, a parallel mapping to OCDS fields for cases where there is an OCDS publication; this is presently only reflected, on that page, as a note that automatic population is possible (I assume this will be detailed in later guidance). The current mapping suggests the primacy of theContractingProcess
fields, many of which I might have expected to be optional where OCDS data is available.Elsewhere, the relationship is made more clear, e.g. on https://open-contracting.github.io/infrastructure/projects/
I think this is a very important point, which should be highlighted and/or repeated. By having process-level data in the project schema, I consider there to be significant risks of:
I haven't checked, but my feeling is that
ContractingProcess
should contain the minimum fields to satisfy IDS, to mitigate risks. However, I might go further and consider whether the OCDS fields present inContractingProcess
couldn't be replaced by embedding/linking to a compiled release fromContractingProcess
. Other options include:ContractingProcess
objects from OCDS data, to increase the likelihood that publishers focus on OCDS for process-level data, then use this script to populate the project dataContractingProcess
, which should not be much more difficult than authoring aContractingProcess
object, to encourage adoption of OCDS for process-level data