iaincollins / structured-data-testing-tool

A library and command line tool to help inspect and test for Structured Data.
https://www.npmjs.com/package/structured-data-testing-tool
ISC License
63 stars 14 forks source link

Support for validating schema properties #14

Open iaincollins opened 4 years ago

iaincollins commented 4 years ago

Summary of proposed feature

Schema properties should be checked for validity.

Purpose of proposed feature

Currently properties are only checked to see if they exist, and not if the value they contain is valid.

Detail of the proposed

The value of properties should be checked.

This may include primitive types (strings, numbers) as well as specific types (dates, URLs) and complex objects (including nested types).

Potential problems

As per the outline for the milestone for version 5.0, doing this for all schemas is expected to involve extending the schema.org scraper and writing a parser to handle scraping and using meta-programming to create tests that apply validation rules to properties.

Initial versions may include simple handling for primitive types and easily checkable types, but supporting complex types and properties that can be one of many types will be more difficult and support for that will likely come later. There may be edge cases it is not practical to support.

Describe any alternatives you've considered

It would be nice to have a list of valid templates that are parsable (e.g. in JSON Schema format) but I have not been able to find a suitable library of these and it does not appear there is a list of them published by Schema.org.

Additional context

Is there value is creating JSON Schema profiles, as something other people could reuse?

This would require extra work to integrate schema validation into this tool, but that is something I am familiar with from other projects.

raffaelj commented 4 years ago

and it does not appear there is a list of them published by Schema.org.

Have a look at https://schema.org/docs/developers.html and scroll down to "Vocabulary Definition Files" if you didn't find that before.

"all-layers" seems to cover all https://schema.org/version/6.0/all-layers.jsonld

and "schema" covers a lot http://schema.org/version/latest/schema.jsonld

I didn't look through the whole json strings, but I did a quick search for WebSite and WebPage properties. They still need to be converted into a parsable string with an iteration over the domainIncludes and rangeIncludes properties.

iaincollins commented 4 years ago

Hey there, I've actually been working this over the last couple of days!

The data for mapping properties to types is sourced from the following files (the latter two are for schemas and properties that are not yet final / still in draft stage):

I've added checks for if a Schema.org property name is valid (passes), invalid (invalid) or valid but still in draft (warning). It does not yet support nested properties or check values.

I will likely tackle nested properties then value checks, as the behaviour for testing nested values will likely impact type checking if I do it the other way round.

I've started expanding the tests too. Adding this actually found real errors in a couple of the example schemas (such as invalid properties on test schemas).

raffaelj commented 4 years ago

Thanks for the csv sources. I wasn't aware of that github repository.

I would suggest to transform the csv files to a sqlite database with relations instead of parsing the csv files all over again and splitting the results on ', '.

iaincollins commented 4 years ago

Hey there! After a few weeks break I've followed up and done some more work on this for 5.x.

It's in master, but is not released to NPM yet, I'd like to continue to work on it for a while and do some refactoring to clean things up first.

I'll probably add validation for nested properties, think about if I want to tackle validating the content of properties in 5.0 (or save it for an update for 5.1) and think about some of the edge cases that can arise (e.g. referring to other schema objects on the same page using @id, which I've written up in #21).

I suspect I'll try and do nested validation for properties, just to indicate if they are valid or invalid for a given schema (this something I worked on in similar projects) and then more code clean up and cut a release before I go much further - ie before I add things like actually validating the value of properties, which I think will be in 5.1.

I'm actually included to try and avoid using sqllite as it creates a dependancy that can be awkward (it's a super useful library, but somewhat heavyweight and can have breaking changes), I've run into issues with it before and normally think it's great but am not as keen to introduce it as a dependancy for a library as I would be in an application.

Right now the performance hit of CSV is negligible, but in future I'll probably end up transforming the CVS files to JSON when they are imported (and writing an import script) as it's extremely fast to work with and has no dependancies, while also being easy to diff changes.

Will keep posting updates on progress!