open-contracting / infrastructure

Documentation of the Open Contracting for Infrastructure Data Standards (OC4IDS) Toolkit
https://standard.open-contracting.org/infrastructure/
Other
6 stars 0 forks source link

Contract-level information in project schema #9

Closed jpmckinney closed 5 years ago

jpmckinney commented 6 years ago

The CoST-IDS mapping refers only to ContractingProcess when mapping process-level information. However, I would have anticipated, at least, a parallel mapping to OCDS fields for cases where there is an OCDS publication; this is presently only reflected, on that page, as a note that automatic population is possible (I assume this will be detailed in later guidance). The current mapping suggests the primacy of the ContractingProcess fields, many of which I might have expected to be optional where OCDS data is available.

Elsewhere, the relationship is made more clear, e.g. on https://open-contracting.github.io/infrastructure/projects/

Where OCDS data is available, the contracting details section should act as an index of (cached) OCDS releases, with explanations of any variations detected when comparing releases.

I think this is a very important point, which should be highlighted and/or repeated. By having process-level data in the project schema, I consider there to be significant risks of:

I haven't checked, but my feeling is that ContractingProcess should contain the minimum fields to satisfy IDS, to mitigate risks. However, I might go further and consider whether the OCDS fields present in ContractingProcess couldn't be replaced by embedding/linking to a compiled release from ContractingProcess. Other options include:

  1. The documentation might strongly recommend the implementation of OCDS for contracting data, with a non-OCDS approach being framed closer to a last resort
  2. Code might be provided for generating ContractingProcess objects from OCDS data, to increase the likelihood that publishers focus on OCDS for process-level data, then use this script to populate the project data
  3. Guidance might be authored to show how to author a compiled release that contains all the information in ContractingProcess, which should not be much more difficult than authoring a ContractingProcess object, to encourage adoption of OCDS for process-level data
timgdavies commented 6 years ago

This is a good and important discussion.

To weigh up, there are also risks of not having process level data in the project schema, in that:

Whilst in the case of 'perfect data' the project level data may be redundant, given experience of real-world publication practices, I don't think it will be.

However, I think we will need to do more in implementation guidance to make sure this doesn't cause confusion, and to mitigate the risks that it creates ambiguity for users and publishers. For this, I had anticipated we would be most likely to explore your option 2 above, in providing worked examples (and possibly code) to show how, in cases of good contracting data, it should be possible to populate the contractingProcess level data from OCDS.

jpmckinney commented 6 years ago

The ContractingProcess definition is not so far off from being aligned with release-schema.json. For example, if some fields were scoped within a tender object or within a single-item awards or contracts array, then there would already be more alignment between the two. I don't think having these objects and arrays makes implementation and use too onerous on implementers or users. Essentially, my first, unnumbered proposal (and/or option 3) is to replace ContractingProcess with a cut-down compiled release. I don't think that proposal contributes to the risks you list.

timgdavies commented 6 years ago

I'm not sure I see the grounds for your assertions that this would not make implementation and use too onerous, nor that the suggested approach doesn't contribute to the risks noted.

Your proposal appears to place the burden on interpreting 'summary values' onto the user rather than publisher of the data - which, from my experience of implementation - is something we want to avoid.

It may help to think about ContractingProcess as a summary of detailed broken down releases - and to note that this exists because the CoST IDS (and it's associated business processes) generally operate at this level of summary, rather than at the level of disaggregated data. If we only offer the option of using releases, and ask people to create 'simulated' releases in cases where they have not collected disaggregated data - we are asking people to create fictional disaggregated data, rather than just to express the summary data they have, which I think risks more confusion than having the extra level of data and data model.

timgdavies commented 6 years ago

At @duncandewhurst's request I've done some more detailed exploration of modelling options here.

Structure

@duncandewhurst has raised that one method of alignment between ContractingProcess and an OCDS release, to avoid similar-but-different data structures would be to, where a ContractingProcess summary field would be derived from a particular field in OCDS releases, to follow the structure of the release.

For example, instead of the current structure which is relatively flat:

"contractingProcesses":[{
    "procurementMethod":"",
    "procurementMethodDetails":""
}]

we would have:

"contractingProcesses":[{
    "tender":{
        "procurementMethod":"",
        "procurementMethodDetails":""
     }
}]

Whilst this works for some fields (particularly from tender) is is more complex for others. For example:

Embedding / linking

@jpmckinney's suggests above an alternative approach of "embedding/linking to a compiled release from ContractingProcess" instead of including fields in ContractingProcess. In this situation, the following fields, which can be partially derived from a compiled release would be removed from ContractingProcess:

For the starred items, I can see specific complexity in relying on extracting data from compiledRelease either due to the structure of a compiledRelease or experience of current data supply, namely:

(1): the costEstimate should be the tender.value from the point of planning, and if tender.value is revised during the tender process, costEstimates should not be updated. However, in a compiledRelease, an earlier tender.value becomes inaccessible, and is only available from is a versioned release is also created, or by looking at all releases.

(2) and (3): we don't see widespread use of transactions or bids features are present. In the case of transactions there may be cases where systems also want to fetch this data from a different finance system.

(4) administrativeEntity is not currently included in OCDS, and we are not aware of implementers capturing this.

In all the other cases, we need to be aware that there will be cases where data from procurement systems may have (a) missing values; or (b) values that are incomplete or not 100% trustworthy.

The current design of ContractingProcess provides redundancy for the cases where either the data in a compiledRelease is missing, or is not trusted as an accurate summary, such that those curating the ContractingProcess summary want to over-ride it from some other source. It also places the decision making over where in the releases to extract data from onto the data producer, not the user.

The alternatives to this I think are to:

What I've seen so far makes me discount the first option. The second feels to me more complex both for data publishers, and for data uses, than having some redundancy in ContractingProcess.

Conclusions

I think the current model strikes the right balance when it comes to implementation. The schema includes guidance on how ContractingProcess fields may be derived from OCDS releases, but keeps this language at the level of MAY, rather than SHOULD, to ensure the project level data specification can be used in a wide range of implementation scenarios.

I will however do some work now on improving the language around ContractingProcess to make sure it's status as a summary is clear.

jpmckinney commented 6 years ago

My suggestion in https://github.com/open-contracting/infrastructure/issues/9#issuecomment-424138658 was the same as @duncandewhurst's regarding structure, so thank you for considering this option.

With respect to the proposal, I propose renaming ContractingProcess and contractingProcesses to reflect the semantics of being a summary, e.g. ContractingProcessSummary and contractProcessSummaries. Otherwise, there will be no indication within the data that these are intended to be interpreted by users as summaries.

With regards to specific comments:

With regards to general comments on approach:

I don't understand in what sense any data here is 'fictional' or 'simulated'. Many implementers provide historical OCDS data. Wouldn't that also be fictional and simulated?

I also don't understand how making cosmetic changes to ContractingProcess, to make it align more with the release schema (wrapping groups of fields with "tender": { … }, "awards": [{ … }], and "contracts": [{ … }]), can change the data from summary data to disaggregated data. A compiled release is already a summary; it's a merger of releases.

Lastly, gaps and inaccuracies should be addressed, but I'm not sure that adding another source (which may have its own gaps and inaccuracies) is the best solution. A common complaint among users is about discrepancies across different sources, and the proposal adds more opportunities for these. For example, when a summary contradicts OCDS data, a user can't tell whether it's because the publisher is correcting an error (which, presumably, it was incapable of fixing at the source) or it's because the publisher made an error in preparing the summary. The proposal does place the decision of how to extract information from OCDS data onto the publisher, but the user is still left with the challenges of comparing the summary and handling any discrepancies. The user might not be any further in making sense of the jurisdiction's data as a whole – unless they simply trust one source and ignore the others. In terms of trust, though, I don't estimate the likelihood that a publisher 'gets it right' with a new dataset to be significantly higher than with existing datasets. In short, I think users are ultimately better served by publishers focusing on the improvement of one dataset in one format, rather than making corrections to reformatted data. That said, my other comments are independent of that opinion.

timgdavies commented 6 years ago

On contractingProcesses -> contractingProcessSummaries - I think the documentation is already very clear about what contractingProcesses should contain, and it is not true to say that it is only a summary found under here: it also provides access to releases with much more detail of the process level down.

I think you may be missing the important point here that not all implementers of OC for Infrastructure will have access to reliable incoming OCDS Data from which to populate their project level data, or there will be cases where that data is patchy because it comes from procurement systems over which they have limited control such that, for example:

Whilst theoretically it may be possible to fix or extend the OCDS data, get more fields published within that data, and improve it's coverage, practical experience shows that (a) takes a long time; (b) requires political work to secure support for changes; (c) may be outside the control of the the team wanting to undertake infrastructure monitoring.

In these cases then, I see two options:

It is (1) I was referring to as 'simulated' data. Whilst I can see this has some merits as involving fewer new fields: (a) the releases and merging model of OCDS is not as tried-and-tested as I would like it to be in order to have confidence in this working well, and being really well understood; (b) it feels like it would be much harder to explain and architect for in implementation terms.

The design is based on a grounded exploration of real world data, and discussions about the real-world needs of infrastructure monitors.

I also don't think it is the purpose of a data model to resolve discrepancies between data sources. These need to be addressed at the level of business process and data quality management, not at the level of the data model. It is important to recognise that there can be good reasons for discrepancies and differences between the value two systems hold for a given concept, as well as bad reasons. Architectures that ignore this risk reducing the freedom of people to hold different views of a situation, or to carry out their business processes effectively.

Limiting discrepancies has to be a question of supporting implementation, creating validation tooling, offering reference implementations of systems that compute values, and monitor for discrepancies, and in providing guidance to users. I say this because:

jpmckinney commented 6 years ago

I'll address the above comment in separate comments.

On contractingProcesses -> contractingProcessSummaries - I think the documentation is already very clear about what contractingProcesses should contain, and it is not true to say that it is only a summary found under here: it also provides access to releases with much more detail of the process level down.

In OCDS, a record has two field for details (releases and versionedRelease – neither contains more information than the other) and a field for a summary (compiledRelease – which contains less information, as prior values are omitted). If the data structure for a record were instead that the compiled release had a field for releases (like contractingProcesses does), it wouldn't change the semantics of the compiled release being a summary. So, I think it is true that contractingProcesses are summaries. The quoted comment seems to confuse semantics and structure.

Furthermore, we know from experience that the documentation might be clear, but people nonetheless interpret the schema based on the terms it uses. I think it would be prudent to use terms that are less likely to be to misinterpreted. I don't see a downside to the proposed change in term.

jpmckinney commented 6 years ago

It seems Tim's comment is cut-off. Many comments seem tangential to the proposed changes – but I generally agree, and I fully understand the variety of scenarios this project is intending to support.

I, however, simply don't see much reason to have two very similar but incompatible schema for contracting data, and I don't see how aligning schema to achieve compatibility causes any harm.

I worked up an alternative project-schema.json to demonstrate. The changes are:

With these changes, more of this project's schema is compatible with OCDS. That's not to say that data following this schema would be valid in OCDS (e.g. there are missing required fields), but this schema, at least, is not using different terms for the same concept.

There are now far fewer differences. If we assume the use of ocds_process_title_extension, this project's schema only adds:

Anyway, the point here is to promote compatibility. When there is compatibility, there's:

If you look at the changes, they are small. I hope they can be accepted.

I haven't updated the Markdown files, added or updated titles and descriptions in the schema, or re-indented the schema. I figure that can follow once agreed in principle.

duncandewhurst commented 5 years ago

From discussion with @kindly we think aligning the structure of contractingProcess with compiledRelease makes sense for the reasons at the end of @jpmckinney's most recent comment.

There are some concerns regarding naming however:

Rename ContractingProcess to Record

The concern here is potential for confusion with an OCDS record, which is a different thing, because in OC for Infrastructure the ContractingProcess doesn't necessarily have an associated list of releases and can be produced directly from data entered directly to an infrastructure transparency portal, rather than being compiled from a list of releases.

Make Record have a field for the summary (compiledRelease)

Similarly, there is potential for confusion with an OCDS compiled release, which is a different thing, because in OC for Infrastructure the fields in compiledRelease aren't necessarily generated through compilation from releases.

Whilst aligning with the structure of a compiledRelease makes sense it would be preferable to maintain a distinction in the names of the properties so that users don't conflate elements of the OC for Infrastructure schema with OCDS records and compiledReleases.

We discussed alternative names for these two properties to make the distinction between OCDS and OC for Infrastructure clear:

However there are issues with the use of 'summary' since the data in contractingProcessSummary could exist independently of any further detailed information (i.e. if no releases or variations exist).

Rather than mirroring the structure of an OCDS record and nesting contractingProcessSummary within contractingProcess, we propose aligning the structure of contractingProcess with an OCDS compiled release, with releaseList as a property and variations nested based on the outcome of discussion in #5, for example:

{
   "contractingProcesses": [
      {
         "ocid": "",
         "releaseList": [],
         "variations": [],
         "tender": {
            "procurementMethod": ""
         }
      }
   ]
}

This approach delivers the benefits of compatibility without conflation of terms with OCDS.

If we're happy with this approach I'll work up the schema in a PR

jpmckinney commented 5 years ago

This sounds reasonable, but I'd like to push one point a bit. In OCDS, in contexts where publishers are unable to publish a series of releases, as I remember, our guidance has been to update a single release – essentially treating it as a compiled release. (I can't find an example – do you remember?) In such contexts, records wouldn't have a list of releases, and compiled releases wouldn't be the result of compilation. I'm not sure that there is a conceptual difference here.

The Record is described as:

An OCDS record must provide a list of all the existing OCDS releases relating to a single contracting process and should provide a compiled release containing the current state of all fields in the OCDS schema. An OCDS record may also provide a versioned history of all changes to the data in the compiled release.

And the compiledRelease is described as:

This is the latest version of all the contracting data, it has the same schema as an open contracting release.

Essentially, the two concepts are "all available data (within this system) about one contracting process", and "the present state of one contracting process". The concepts, from this perspective, are independent of how the data is generated. In general, our concepts should relate to facts about the world – not facts about the data production process. The fact of whether a compiledRelease was compiled or not is unimportant. What is important is that it is "the present state of one contracting process." (I admit the names of the two concepts need improvement.)

As such, I don't see a conceptual difference between the two concepts in the project schema and the two concepts in the record package schema.

As I understand, the differences between your proposal and mine are:

kindly commented 5 years ago

@jpmckinney @duncandewhurst

My current preference is for it too look something like:

{
   "contractingProcesses": [{
      "contractingProcess": {
         "ocid": "",
         "variations": [],
         "tender": {
            "procurementMethod": ""
         }
      },
      "ocdsReleases": []
   }]
}

So in short:

I do have a slight issue with contractingProcess as a name as it is a singular version of its parent contractingProcesses but it has more that one property within it. However, I agree with @jpmckinney that the ocdsReleases should ideally not be nested within the contractingProcess.

Nonetheless, I prefer this to the compiledRelease as the compiledRelease in this case explicitly does not have the same schema as the releases.

So if we could come up with a better name for contractingProcess/compiledRelease that that would be ideal in my opinion. All the names I can come up with are a not great i.e currentProcessStatus, currentProcessState, ocdsInfrastructureProcess

duncandewhurst commented 5 years ago

In OCDS, in contexts where publishers are unable to publish a series of releases, as I remember, our guidance has been to update a single release – essentially treating it as a compiled release. (I can't find an example – do you remember?) In such contexts, records wouldn't have a list of releases, and compiled releases wouldn't be the result of compilation. I'm not sure that there is a conceptual difference here.

This is subtly different from what we have been recommending in this scenario, which is that publishers update a single release but update the release identifier when they do so, thus from the perspective of a user who, for example, scrapes the data daily, the publisher is actually publishing multiple releases even though only the latest release for each contracting process is available at any given time.

We have also been recommending that they publish these as releases (i.e. wrapped in release package) rather than as a compiledRelease (i.e. as part of a record) since a record must have a list of all releases, which such publishers cannot provide.

In terms of the descriptions, for record:

An OCDS record must provide a list of all the existing OCDS releases relating to a single contracting process...

This isn't true for the proposed use in OC for Infrastructure, since there may be no list of releases.

For compiledRelease:

...it has the same schema as an open contracting release.

This isn't true for the proposed use in OC for Infrastructure, since the schema differs.

Regarding the other points:

Keep releaseList instead of releases. Can you explain?

Sorry, that wasn't intentional. I prefer the name proposed by @kindly above (ocdsReleases)

Avoid wrapping fields in a compiledRelease field. I think it's better to put the "present state" and the "full history" side-by-side, rather than embed the full history inside the present state. I don't see how conceptually a full history belongs inside a present state, how a full history is a child of a present state, etc. If we can't agree on compiledRelease, then we can use another term.

This was to avoid use of summary and to get around needing a second name other than contractingProcess, however if we can agree another term I'm happy to keep these side by side.

jpmckinney commented 5 years ago

I understand that there are technical differences (releases being a required field, etc.); my point has been that technical differences don't amount to conceptual differences. That said, the words for the OCDS terms imply technical details ('compiled' release), and the definitions describe technical details (releases being required), and so OCDS' terms make a poor choice for reuse (until and unless at a later date we clearly separate the concept's semantics from its implementation details).

So, let's go with ContractingProcess for the object definition.

For the list of releases, I'd prefer a term that doesn't imply implementation details, to go along with the type of separation described above (ocdsReleases implies the use of OCDS). Why not simply releases? There are no such things as 'OCInfra releases' – the only things called 'releases' are OCDS, so there is no opportunity for confusion. We currently have releaseList; the typical pattern for naming a list of things is to use the plural, not to suffix 'List'. I don't think we need to start a pattern of prefixing a standard's acronym to terms reused from that standard.

As for contractingProcess, having a field match the name of the class to which it belongs is confusing, e.g. a Cat class with a cat property would be surprising. The only reason given against summary has been:

However there are issues with the use of 'summary' since the data in contractingProcessSummary could exist independently of any further detailed information (i.e. if no releases or variations exist).

As explained, I don't see how this is an issue. The summary would exist, while further detailed information would exist in the real-world, in contracting systems' databases, in other data formats, etc. We can call it a synonym like 'digest' if preferred.

duncandewhurst commented 5 years ago

Okay great, I'm happy to settle on using releases and summary as you propose.

I've updated the 9-demonstrator branch with these changes, along with updates to the titles, descriptions and markdown files (PR https://github.com/open-contracting/infrastructure/pull/29)

I've also reverted the id back to externalReference, to keep it distinct from the id field in an OCDS compiledRelease, since the two fields have significantly different definitions (although the merging rules state that the id field should be omitted from a compiledRelease in OCDS it is still present in the schema).