open-contracting / standard

Documentation of the Open Contracting Data Standard (OCDS)
http://standard.open-contracting.org/
Other
139 stars 46 forks source link

Add describedby field for the extended release schema #426

Open irwink opened 7 years ago

irwink commented 7 years ago

Edit: This issue effectively starts at https://github.com/open-contracting/standard/issues/426#issuecomment-718963839

$schema is meant for the "metaschema", not for the "schema". The linked comment proposes using a describedby field to link to the schema.


I suggest that a "$schema" field be added to all contracting data files. The "$schema" field's value would be either a single URI of the schema that the data claims to conform to, or a list of schema (e.g. if the data conforms the OCDS and and extension schema). This would be very useful from both a quality assurance perspective as well as parsing and consuming the contracting data. Programs would know which schema the data conforms to and how to properly parse them. This is especially useful if the data repository contains a mix of data files that conform to the OCDS, extension schema or some other schema.

An example would be (using the Paraguay sample data)

{ "uri": "https://www.contrataciones.gov.py/datos/record-package/273637.json", "$schema": "http://standard.open-contracting.org/schema/1__0__1/release-schema.json", "publisher": { "uri": "https://contrataciones.gov.py/datos", "legalName": "Dirección Nacional de Contrataciones Públicas, Paraguay", "name": "DNCP - Paraguay" },

timgdavies commented 7 years ago

In OCDS 1.1 (see #301) we were planning to handle this with two properties:

Although I note that JSON Schema notes that the $schema keyword can be used for both version and schema declaration.

The reason I believe for diverging from JSON Schema here was:

But, other views on this welcome.

Assigning to @kindly and @bjwebb to have a quick glance at whether we should alter the OCDS 1.1 approach before we're committed to it too strongly.

irwink commented 7 years ago

The suggestions in issue #301 would handle the extensions problem, however, an application would have to "know" that a JSON file claims to conform to the Open Contracting standard and would also have to know where the schema is located (to validate against it). Furthermore, if the JSON file repository contains a mixture of Open Contracting files and other non Open Contracting files, there is no predictable way to distinguish them. The use of a "$schema" field (or some other widely adopted equivalent) would provide an explicit schema reference (similar to a DOCTYPE declaration in a web page).

mireille-raad commented 7 years ago

in the same spirit, it would be useful to have a similar field to $schema but for extensions. As implementations get more complex, and as multiple extensions are used, it would be useful to have a reference to all that somewhere. Maybe the $extension would be a closed codelist of the official OCDS extensions.

jpmckinney commented 7 years ago

Regarding extensions, couldn't $schema be a URL of a release schema that has been patched with the relevant extensions? The value of $schema in this case would not be useful for identifying the version of OCDS, but the purpose of $schema in JSON Schema is for validation - not for version identification.

jpmckinney commented 4 years ago

Copying comment from https://github.com/open-contracting/infrastructure/pull/89

Regarding versioning, this might be better handled by using the $schema property, which is part of JSON Schema. That property is standardized, and thus has a lot of existing tooling that understands it, and can use it to perform JSON Schema validation.

kindly commented 4 years ago

I think the use of the $schema flag is a good idea and really good for validators themselves to not need to json-merge-patch the extensions. However, I am also worried about the publishers ability to do this compilation and to host a version of a new schema.

So in order to do this well I think we will need to host some kind of service that creates the extended schema for the publishers.

So a tool that you can select a set of extensions from the extension explorer and then compiles it and then gives a permanent URL for that generated schema, which is stored for ever.

The permanent url could be of the form:

http://standard-schemas.open-contracting.org/1__2__0/release-schema.json?bids=v1.1.5&budget=master

This will be cached on the service for a period. Doing it this way means the service will not have to actually store any new urls permanently (which would be a risk for example if there is data loss) as the schemas can be regenerated if needed.

The other benefit of having this service, is that we know that the extended schema is actually compliant with OCDS (as everything that runs through the service would be). Otherwise if a publisher linked to their own schema they could make the schema non compliment with core OCDS and we then would need to find a way to test that.

Without this service I think just having the extension list on the release level would be acceptable as well but not ideal.

kindly commented 4 years ago

Having codelist compilation outside the DRT would be really beneficial too.

So we would also need something like. http://standard-schemas.open-contracting.org/1__2__0/codelists.zip?bids=v1.1.5&budget=master

jpmckinney commented 4 years ago

Yes, the ProfileBuilder can do that work; it's what's used to patch schema and codelists for OCDS profiles (example output).

Building such a service makes sense to me. I'm hesitant about adding more infrastructure to the standard, but we can make it easily deployable (e.g. with a "Deploy to Heroku" button – not sure if any other PaaS offer something similar), so that anyone can host the service, so there isn't a single point of failure.

Another option would be to still require publishers to host the schema and codelist files, but for that schema file to be easily validated, e.g. it references the OCDS version and extensions it uses. The URL of the schema file can then be provided to a validation service, which reports whether the schema file matches what the above service would have generated (maybe excluding metadata properties like title and description so that it just checks the validation properties are as expected).

kindly commented 4 years ago

Perhaps we say that the publishers should host the schema and codelist files when publishing to production, but this service could be there to:

This means the service could be self hosted and the more perminant $schema urls do not rely on this service to be running.

The other option is for this serv

jpmckinney commented 4 years ago

@kindly Your last sentence seems to be cut off?

kindly commented 4 years ago

@jpmckinney oops.

I was going to say that we could have a way for the schema/codelists files to be uploaded to a service like s3 and stored permanently which could be owned by OCP. This would mean that the service itself would not need very good uptime/redundancy but the results should have it. The cost of this is likely very small, but would mean a potentially unknown permanent cost and may require some management on who could upload to it. Nonetheless, this could be the easiest route for publishers without OCP having to worry about uptime/redundancy of a service.

jpmckinney commented 4 years ago

Sounds good to me! Once a PR is made for this issue, I'll create a follow-up issue in https://github.com/open-contracting/extension_registry.py, and another issue somewhere for creating this new service (maybe it's just another functionality of Toucan). This is in addition to all the other issues that will be created for a change in packaging.

yolile commented 4 years ago

Having the patched schema with all the extensions hosted somewhere will be very useful when using the flatten tool with the --use-titles feature. And also If a publisher wants to document all the fields that they are using, including extensions it will be easier for them to use the mapping-sheet command from ocdskit or toucan to create a data dictionary of their publication.

yolile commented 4 years ago

Although, isn't this kind of in conflict with #1084?

jpmckinney commented 4 years ago

Although, isn't this kind of in conflict with #1084?

What is the conflict with #1084? The $schema field will appear on each release, not in the package.

yolile commented 4 years ago

Great! No conflicts then, based on https://github.com/open-contracting/standard/issues/426#issue-207227238 I thought that the $schema field would be at the package level. Maybe we should update the issue to "Add $schema field to release schema and contracting data"

jpmckinney commented 4 years ago

Ah, we do also want a $schema field on the schema files (see #566). The issue description gives an example where $schema is on the package, but in this issue we've discussed to just put it on the release.

That said, I've re-read the JSON Schema specifications (04, latest), and $schema is explicitly and narrowly for "meta-schema" (that is, schema for validating schema) and it must be at the top-level. So, $schema is the correct field for #566, which doesn't interact with this issue.

Related to this issue, the 04 and latest versions of JSON Schema both recommend using Content-Type and Link headers to reference the schema (not the meta-schema) that a JSON file follows.

However, in the use cases we've witnessed, data might be downloaded and stored for later analysis, and the request headers are unlikely to be stored. It seems simpler to users if publishers reference the schema in the data itself. However, to avoid confusion/overlap with $schema, which has specific semantics, we can maybe use a plain schema field.

Of course, if a publisher is capable, they should set those headers when returning JSON data.

The latest JSON Schema draft has useful considerations around how servers should return, and how clients should request, schema files, to limit repeated network traffic for the same file. This will be especially relevant, since a package can contain thousands of releases, each with an identical schema field, and we wouldn't want that to cause thousands of requests.

jpmckinney commented 4 years ago

Actually, building on #928, it might be best to do:

{
  "links": [
    {
      "rel": "describedby",
      "href": "https://..."
    }
  ]
}
duncandewhurst commented 3 years ago

I agree that it sounds sensible to use links. Is any further discussion or consultation required before preparing a PR?

jpmckinney commented 3 years ago

928 would add the links field. For this issue, we'd have to also author tools to help publishers generate a patched schema (describedby is very unlikely to be used, otherwise). We'd also need to update tools to use this value instead of patching the release schema with extensions. We don't yet know whether we have the capacity to do that, so this issue might be postponed to a future version.

jpmckinney commented 1 year ago

Moving to 1.3.0/2.0.0 as we don't have the capacity to assist this transition with tooling, etc.