usnistgov / oar-pdr

The NIST Open Access to Research (OAR) Public Data Repository (PDR) system software
11 stars 10 forks source link

Not matching schema causes error only at render step for NERDm records #305

Open bmaranville opened 1 year ago

bmaranville commented 1 year ago

We made some (nominally) NERDm records and uploaded to the dev API, and they were accepted. It was only when trying to render them with the angular server /lps endpoint that we discovered a problem - the page returned an error and did not render the record. The error trace said something about a value being undefined.

After some slow investigation and looking at the schema, I found that a) the problem was caused by us not complying with the schema and b) it was because records contained the top-level key "keywords": string[] instead of the required "keyword": string[]

We already made the change to use the key keyword on our end, but I wonder what the appropriate place to validate the records is? At render-time is not ideal. Would it make sense to validate as part of the ingestion, returning an error if the record does not match the schema? I feel like that is a better guarantee of consistency than asking submitters to validate before PUT/PATCH on a record.

RayPlante commented 1 year ago

The service validates records at two points: when NERDm record is submitted and when the service receives the "publish" action (actually, it may be "finalize" action which is implied by "publish"). The former validation is a laxed version that allows for required properties (like "keyword") to be missing; this allows one to upload a partially incomplete record, but then update with missing information. The latter validation is fully strict.

You have hit upon one gotcha: the schemas, on purpose, allow for additional properties to be included that are not part of any of the defined schemas; this is a key extension feature. On the other hand, the service (i.e. independent of the schema) will filter out unrecognized properties (I'm pretty sure) silently. As a result, if you mis-spell a property, it goes undetected (as you demonstrated).

I should note something about the landing page service: while the service should not have complained about the missing "keyword" data, we had not originally intended to run the landing page service with the publishing service, as the latter was intended to be fully automatic without support for human interactions. While we now plan to make the landing page service available for these records, we haven't actually set that up yet.

That said, I think we can improve the service to try to catch near-misses in keyword spellings (perhaps by rejecting records containing properties we would otherwise filter out).

bmaranville commented 1 year ago

Ah - I think I understand. I'm not able to test our workflow against the second validation (on "finalize"), without setting up an entire mock PDR database, right?

Should we have expected to see errors returned from the test endpoint on "finalize"? We did publish some of our malformed records there - and didn't get any errors back.

RayPlante commented 1 year ago

Actually, "finalize" should work. I was just looking at the code, and now I have a question about whether the intended validation actually gets run as it should. Nevertheless, that is the point of finalize: it makes all remaining tweaks to the record so that it looks fully like what will get published, and then check everything. If there is a problem, the client has one last opportunity to fix it (which normally wouldn't happen in a fully automated workflow).

I should correct two mis-statements I made above: (1) the validation done when NERDm metadata is submitted is not laxed, and (2) unrecognized properties are not filtered out. I was confusing this service with another one we are developing that has a lower trust of the client. Regardless, since it does allow custom properties, it will miss mis-spellings like "keywords".