seedcase-project / seedcase-sprout

Upload your research data to formally structure it for better, more reliable, and easier research.
https://sprout.seedcase-project.org/
MIT License
0 stars 0 forks source link

Investigate and brainstorm ways of checking against the data package schema #864

Open lwjohnst86 opened 1 week ago

lwjohnst86 commented 1 week ago

For instance, run against everything in the schema at once, or specific things, etc?

lwjohnst86 commented 1 week ago

The frictionless R package has a nice design, along with some checks, e.g.: https://github.com/frictionlessdata/frictionless-r/blob/main/R/check_package.R

Could be used for inspiration.

martonvago commented 6 days ago

For metadata, the default behaviour of jsonschema is to run all checks at once. We could customise this either by modifying the schema (e.g. extracting a sub-schema from it) or using the whole schema and filtering out errors we want to ignore.

As for whether there are scenarios where a partial check would be useful/necessary, I think one candidate is when we are building the metadata and it is not yet complete (e.g. before adding resources to a package). Here we presumably want to disable required checks for ?some? fields?

martonvago commented 5 days ago

Just for me to keep track of questions:

  1. Do we want separate checks for package and resource properties?
  2. How do we want to handle metadata checks that we do separately from the Data Package standard? We check extra required fields and we check that there's an ID in the path field for resources.
  3. Do we want to collect all errors before throwing an error? Or do well-formed and complete checks "sequentially", returning early if there's an error?
  4. What kind of errors should we throw? Maybe keep NotPropertiesError but (1) add an errors attribute to hold error objects (useful for components built on top of core) and (2) add a human-readable summary message?
  5. How correctly do we want to check for blank values in the complete check?
martonvago commented 4 days ago
  1. Maybe there's an argument for not having separate checks for package and resource properties?
    • When we write metadata to datapackage.json, regardless of what is being updated, we always write the whole properties structure, so running one check function makes sense.
    • Then, during one update action, the properties would be checked once, right before being saved. At this point they should be complete and well-formed, so all kinds of checks are expected to pass.
    • But shouldn't we check that whatever metadata we load from datapackage.json is correct/complete? Maybe we actually want to allow users to correct bad/incomplete metadata. In which case we don't want to fail if metadata is incomplete or otherwise incorrect. We only want to make sure that we're getting JSON from datapackage.json. As for extra/custom fields coming from datapackage.json, the standard says those are allowed, so we don't want to fail for those either.
    • Similarly, shouldn't we check that metadata supplied by the user (e.g. when updating) is correct/complete? While this input cannot be expected to be complete (e.g. when updating only one field), it is expected to be otherwise correct. So we could check that it has values of the right type and format. Would it add anything to do these checks at the beginning of the flow, given that they are included in the checks at the end of the flow anyway? The flows are not exactly long or complicated...
    • Running one check is simpler than splitting it up, in terms of implementation.
    • If we wrote separate checks for package and resource properties, we wouldn't combine these checks for a full check, because the full check is the default behaviour. So we would end up with at least 3 check suites.
martonvago commented 4 days ago
  1. Maybe have a function that loads the DP schema and adds our own requirements to it programmatically (either to the JSON schema or as custom validators to the validator).
    • Could add them to the schema, either programmatically or just by editing the schema file by hand.
    • Pros: simple; one kind of language used for describing requirements (JSON schema); errors thrown are uniform (ValidationError), which is nice for components using cli
    • Cons: this would create a schema that is a stricter version of the Data Package standard; updating the schema would involve the extra step of adding our extra requirements, although this just means running a simple function
    • Could add them as separate functions, like we do now.
    • Pros: clear separation between our requirements and DP standard; easier to make our checks conditional
    • Cons: more complex code; our requirements would not raise ValidationError, so we would need to have a further layer of abstraction on top of the error classes; as the DP standard has its own required fields, we still wouldn't have a clear split between required-checks and well-formed-checks
martonvago commented 4 days ago
  1. If we integrate all checks before running jsonschema's validate, then we can collect all errors easily and all errors will be the same type. 🎉
martonvago commented 4 days ago
  1. Yes, and NotPropertiesError can expect ValidationErrors and generate a summary of them for the error message.
martonvago commented 4 days ago
  1. If the checks are integrated into the JSON schema (or jsonschema's custom validator), it's straightforward how to check blank values correctly (i.e. no extra code needed to establish the type of a required field and the corresponding blank value).
lwjohnst86 commented 4 days ago

I'll try to respond to as many of these questions as I can:

1.

I agree with you, that having a single check function that runs against their schema via jsonschema makes the most sense. Rather than artificially splitting it.

I think on load there should be a check as well, since it is entirely possible that at some point, someone makes a change directly in the file. So this starts me thinking, can we instead of throwing an error, give messages of incorrectly written fields as warnings during the computation/checks and maybe some suggestion on how to fix it? If we can't always guarantee that it will be correct. Like say "this is a warning for now, but could cause problems later, please fix now with ...".

An input from the user should definitely be checked, to give warnings/errors before the problem continues. But again, maybe it should be only warnings for now and we can decide how we handle specific cases of errors.

2.

Yes, I definitely think we need our own checks. Internally they would be two, but externally, they would be bundled together. I feel like, but could be wrong, that adding our own checks back to the schema would get complicated and doing simple checks for things as we go. But I can't envision how this all looks very well right now.

Could we not use ValidationError for our own custom checks?

3.

I think I like the "collect them all and then inform the user" rather than "user runs, gets error/warning, has to fix it, runs again, gets error, etc etc".

4.

Yes!

5.

I don't have enough understanding of this particular thing to have a strong opinion. We can always implement something and as we test it/go along, see how it works.