Open irwink opened 7 years ago
In OCDS 1.1 (see #301) we were planning to handle this with two properties:
Although I note that JSON Schema notes that the $schema keyword can be used for both version and schema declaration.
The reason I believe for diverging from JSON Schema here was:
Many validators dereference the remote $schema by default, which can be frustrating for local development and validation against local schema;
$schema only allows a single value, not an array of values
But, other views on this welcome.
Assigning to @kindly and @bjwebb to have a quick glance at whether we should alter the OCDS 1.1 approach before we're committed to it too strongly.
The suggestions in issue #301 would handle the extensions problem, however, an application would have to "know" that a JSON file claims to conform to the Open Contracting standard and would also have to know where the schema is located (to validate against it). Furthermore, if the JSON file repository contains a mixture of Open Contracting files and other non Open Contracting files, there is no predictable way to distinguish them. The use of a "$schema" field (or some other widely adopted equivalent) would provide an explicit schema reference (similar to a DOCTYPE declaration in a web page).
in the same spirit, it would be useful to have a similar field to $schema but for extensions. As implementations get more complex, and as multiple extensions are used, it would be useful to have a reference to all that somewhere. Maybe the $extension would be a closed codelist of the official OCDS extensions.
Regarding extensions, couldn't $schema
be a URL of a release schema that has been patched with the relevant extensions? The value of $schema
in this case would not be useful for identifying the version of OCDS, but the purpose of $schema
in JSON Schema is for validation - not for version identification.
Copying comment from https://github.com/open-contracting/infrastructure/pull/89
Regarding versioning, this might be better handled by using the $schema property, which is part of JSON Schema. That property is standardized, and thus has a lot of existing tooling that understands it, and can use it to perform JSON Schema validation.
I think the use of the $schema flag is a good idea and really good for validators themselves to not need to json-merge-patch the extensions. However, I am also worried about the publishers ability to do this compilation and to host a version of a new schema.
So in order to do this well I think we will need to host some kind of service that creates the extended schema for the publishers.
So a tool that you can select a set of extensions from the extension explorer and then compiles it and then gives a permanent URL for that generated schema, which is stored for ever.
The permanent url could be of the form:
http://standard-schemas.open-contracting.org/1__2__0/release-schema.json?bids=v1.1.5&budget=master
This will be cached on the service for a period. Doing it this way means the service will not have to actually store any new urls permanently (which would be a risk for example if there is data loss) as the schemas can be regenerated if needed.
The other benefit of having this service, is that we know that the extended schema is actually compliant with OCDS (as everything that runs through the service would be). Otherwise if a publisher linked to their own schema they could make the schema non compliment with core OCDS and we then would need to find a way to test that.
Without this service I think just having the extension list on the release level would be acceptable as well but not ideal.
Having codelist compilation outside the DRT would be really beneficial too.
So we would also need something like.
http://standard-schemas.open-contracting.org/1__2__0/codelists.zip?bids=v1.1.5&budget=master
Yes, the ProfileBuilder can do that work; it's what's used to patch schema and codelists for OCDS profiles (example output).
Building such a service makes sense to me. I'm hesitant about adding more infrastructure to the standard, but we can make it easily deployable (e.g. with a "Deploy to Heroku" button – not sure if any other PaaS offer something similar), so that anyone can host the service, so there isn't a single point of failure.
Another option would be to still require publishers to host the schema and codelist files, but for that schema file to be easily validated, e.g. it references the OCDS version and extensions it uses. The URL of the schema file can then be provided to a validation service, which reports whether the schema file matches what the above service would have generated (maybe excluding metadata properties like title
and description
so that it just checks the validation properties are as expected).
Perhaps we say that the publishers should host the schema and codelist files when publishing to production, but this service could be there to:
$schema
url that should work whilst iterating on the data. This is so that they do not have to compile and host a new version of the schema/codelists for every extension change in order for validation to work correctly.This means the service could be self hosted and the more perminant $schema
urls do not rely on this service to be running.
The other option is for this serv
@kindly Your last sentence seems to be cut off?
@jpmckinney oops.
I was going to say that we could have a way for the schema/codelists files to be uploaded to a service like s3 and stored permanently which could be owned by OCP. This would mean that the service itself would not need very good uptime/redundancy but the results should have it. The cost of this is likely very small, but would mean a potentially unknown permanent cost and may require some management on who could upload to it. Nonetheless, this could be the easiest route for publishers without OCP having to worry about uptime/redundancy of a service.
Sounds good to me! Once a PR is made for this issue, I'll create a follow-up issue in https://github.com/open-contracting/extension_registry.py, and another issue somewhere for creating this new service (maybe it's just another functionality of Toucan). This is in addition to all the other issues that will be created for a change in packaging.
Having the patched schema with all the extensions hosted somewhere will be very useful when using the flatten tool with the --use-titles feature. And also If a publisher wants to document all the fields that they are using, including extensions it will be easier for them to use the mapping-sheet
command from ocdskit
or toucan to create a data dictionary of their publication.
Although, isn't this kind of in conflict with #1084?
Although, isn't this kind of in conflict with #1084?
What is the conflict with #1084? The $schema
field will appear on each release, not in the package.
Great! No conflicts then, based on https://github.com/open-contracting/standard/issues/426#issue-207227238 I thought that the $schema field would be at the package level. Maybe we should update the issue to "Add $schema field to release schema and contracting data"
Ah, we do also want a $schema
field on the schema files (see #566). The issue description gives an example where $schema
is on the package, but in this issue we've discussed to just put it on the release.
That said, I've re-read the JSON Schema specifications (04, latest), and $schema
is explicitly and narrowly for "meta-schema" (that is, schema for validating schema) and it must be at the top-level. So, $schema
is the correct field for #566, which doesn't interact with this issue.
Related to this issue, the 04 and latest versions of JSON Schema both recommend using Content-Type and Link headers to reference the schema (not the meta-schema) that a JSON file follows.
However, in the use cases we've witnessed, data might be downloaded and stored for later analysis, and the request headers are unlikely to be stored. It seems simpler to users if publishers reference the schema in the data itself. However, to avoid confusion/overlap with $schema
, which has specific semantics, we can maybe use a plain schema
field.
Of course, if a publisher is capable, they should set those headers when returning JSON data.
The latest JSON Schema draft has useful considerations around how servers should return, and how clients should request, schema files, to limit repeated network traffic for the same file. This will be especially relevant, since a package can contain thousands of releases, each with an identical schema
field, and we wouldn't want that to cause thousands of requests.
Actually, building on #928, it might be best to do:
{
"links": [
{
"rel": "describedby",
"href": "https://..."
}
]
}
I agree that it sounds sensible to use links
. Is any further discussion or consultation required before preparing a PR?
links
field. For this issue, we'd have to also author tools to help publishers generate a patched schema (describedby
is very unlikely to be used, otherwise). We'd also need to update tools to use this value instead of patching the release schema with extensions
. We don't yet know whether we have the capacity to do that, so this issue might be postponed to a future version.Moving to 1.3.0/2.0.0 as we don't have the capacity to assist this transition with tooling, etc.
Edit: This issue effectively starts at https://github.com/open-contracting/standard/issues/426#issuecomment-718963839
$schema
is meant for the "metaschema", not for the "schema". The linked comment proposes using adescribedby
field to link to the schema.I suggest that a "$schema" field be added to all contracting data files. The "$schema" field's value would be either a single URI of the schema that the data claims to conform to, or a list of schema (e.g. if the data conforms the OCDS and and extension schema). This would be very useful from both a quality assurance perspective as well as parsing and consuming the contracting data. Programs would know which schema the data conforms to and how to properly parse them. This is especially useful if the data repository contains a mix of data files that conform to the OCDS, extension schema or some other schema.
An example would be (using the Paraguay sample data)
{ "uri": "https://www.contrataciones.gov.py/datos/record-package/273637.json", "$schema": "http://standard.open-contracting.org/schema/1__0__1/release-schema.json", "publisher": { "uri": "https://contrataciones.gov.py/datos", "legalName": "Dirección Nacional de Contrataciones Públicas, Paraguay", "name": "DNCP - Paraguay" },