hyperjump-io / json-schema

JSON Schema Validation, Annotation, and Bundling. Supports Draft 04, 06, 07, 2019-09, 2020-12, OpenAPI 3.0, and OpenAPI 3.1
https://json-schema.hyperjump.io/
MIT License
216 stars 22 forks source link

Error: Encountered unknown dialect 'https://json-schema.org/validation' #37

Closed GabenGar closed 10 months ago

GabenGar commented 1 year ago

Info

NodeJS - 16.20.0 (I know, more on that later) json-schema - 1.5.1 NextJS - 13.4.19

Repro

This branch basically: https://github.com/GabenGar/todos/tree/unknown-dialect

git clone https://github.com/GabenGar/todos --branch unknown-dialect --single-branch
cd todos
npm run install-all
npm run build

And then get several Error: Encountered unknown dialect 'https://json-schema.org/validation' errors (I assume per worker) during page rendering stage.

Details

I know I don't meet the minimal nodejs req, but I've read all discussions and it looks like it's only due to fetch becoming global in it. And I assume the error is caused by the package trying to fetch something and failing for whatever reason. However I assumed by following these steps:

I'd avoid any network calls at all.

But it crashing at build step (when no validator functions are called/created) means it crashes at init() function. I assumed by importing from "@hyperjump/json-schema/draft-2020-12" I'd get all parts already. What do I have to do to prevent the package from doing any network/fs calls and instead crash outright when not finding something?

jdesrosiers commented 1 year ago

You're getting that error because you're trying to load a schema and haven't declared what dialect of JSON Schema the schema uses. The default is https://json-schema.org/validation and you've only loaded support for 2020-12. Therefore, you get the error that the dialect is unknown. There are no network calls happening in this situation.

To fix this problem, your schemas need to declare the dialect they use with $schema or pass the dialect in the addSchema function. The former is generally considered best-practice.

GabenGar commented 1 year ago

pass the dialect in the addSchema function

What is a retrievalUri in the argument and what do I have to put there if I only want to pass defaultDialectId? Also the error should be more explicit in saying it requires $schema key and found none or couldn't figure out the metaschema. I assumed the functions from "@hyperjump/json-schema/draft-2020-12" would automatically assume draft-2020-12 dialect when not provided, but I guess not.

The former is generally considered best-practice.

Don't know about that, it's mainly a noise in the schema collection derived from the same metaschema.

jdesrosiers commented 1 year ago

What is a retrievalUri in the argument and what do I have to put there if I only want to pass defaultDialectId?

You can read about the Retrieval URI concept here. The short version is that you would use the retrievalUri argument if your schema doesn't include $id. In other words, it's an alternate way of associating a schema to a URI. If you don't want to set the retrievalUri and do want to set the defaultDialectId, you can pass undefined for the retrievalUri.

the error should be more explicit in saying it requires $schema key and found none or couldn't figure out the metaschema.

That's good feedback that the message isn't clear. The problem is that there are multiple reasons the dialect would be unknown and multiple ways to fix it. It's hard to fit all that in an Error message, but I'll see what I can do.

I assumed the functions from "@hyperjump/json-schema/draft-2020-12" would automatically assume draft-2020-12 dialect when not provided

I can see why that would be confusing. The way it works is that all functions work with any configured dialect. This allows for supporting multiple dialects and things like referencing a draft-07 schema from a 2020-12 schema. The dialect-specific imports load support for that dialect, but the functions it exposes are just the generic functions that work with any dialect. You can load as many dialects as you need and those functions will work with all of them. I considered separating the api from loading dialects, but I didn't want users to have to use two imports just to get started. You're the second person who's been confused by that, so maybe I made the wrong choice.

Don't know about that, it's mainly a noise in the schema collection derived from the same metaschema.

There are good reasons for it, but I won't get into that here except to say that we're making changes for the next version of JSON Schema that will render those "reasons" moot and it will make sense to not include the dialect in the schema anymore.

GabenGar commented 1 year ago

You can read about the Retrieval URI concept here. The short version is that you would use the retrievalUri argument if your schema doesn't include $id.

How alternate it is allowed to be? Given these Retrieval URIs and their schemas:

jdesrosiers commented 1 year ago

$id is a complicated mess. The schema identifier is determined by resolving the $id against the retrieval URI. Since the $id is absolute, the resolved URI is the $id and the retrieval URI has no effect. You can reference the schema using either that identifier or the retrieval URI. If there's a conflict between a schema identifier and a retrieval URI, the schema identifier wins and the retrieval URI gets shadowed. So, in your example, the https://example.com/schema/account schema will point to the Account schema and the https://example.com/schema/profile schema will point to the Profile schema.

If you had only loaded the first schema, both https://example.com/schema/account and https://example.com/schema/profile would point to the Profile schema, but no matter which URI was used, https://example.com/schema/profile would be the base URI for resolving any references in the Profile schema.

All that contributes to why I discourage using $id even though the spec encourages it. Just using retrieval URIs is much simpler, more natural, and easier to maintain. However, the retrievalUri argument is just a simulating something like what would happen with a web request (http(s)://) or file system access (file://). This library allows you to not just simulate retrieval URIs, but to actually use them. In your case, I'd suggest using your schemas as files. They're files anyway and translating to use some arbitrary identifier is more works and error prone. Your code application would work the same way it does now except you would pass a URI like file:///path/to/schema/account.shcema.json to your createValidator function and you no longer need an init function at all the load schemas. The schemas are loaded directly from the filesystem. I'm working on an enhancement to allow you to use paths relative the calling file so you don't have to write out the full file: URI, but for now it's usually pretty easy to generate those paths.

jdesrosiers commented 1 year ago

Actually, I just realized that that code is in a directory called "frontend", so this is probably running in a browser. The same concept applies, but instead of using file: URIs, you can use http(s): URIs (https://localhost:3000/schemas/account). Of course that requires you host those schemas at some URI in your website. Those schemas would otherwise be bundled in your JavaScript, so you're not exposing anything you wouldn't otherwise be exposing.

If for some reason, your still not comfortable with that solution, I'd still recommend using retrievalUris in addSchema to identify schemas rather than $id. It's simpler and it's all you need.

jdesrosiers commented 1 year ago

I forgot to mention, if you serve your schemas, you need serve them with Content-Type: application/schema+json; schema="https://json-schema.org/draft/2020-12/schema". The schema parameter sets the default dialect so you don't need to use $schema.

If you use this approach on the filesystem, there's no alternative to using $schema in every schema.

GabenGar commented 1 year ago

If you had only loaded the first schema, both https://example.com/schema/account and https://example.com/schema/profile would point to the Profile schema, but no matter which URI was used, https://example.com/schema/profile would be the base URI for resolving any references in the Profile schema.

This "sometimes kinda $id but not quite" behaviour doesn't sound too swell, as it introduces order-dependent resolution result.

All that contributes to why I discourage using $id even though the spec encourages it.

How are you supposed to reference other schemas within schemas without "$id" value set?

Just using retrieval URIs is much simpler, more natural, and easier to maintain.

It isn't actually, as it assumes schemas can be downloaded with fetch/read from file system at runtime. And my situation is neither.

They're files anyway and translating to use some arbitrary identifier is more works and error prone.

I consider this a boon as it stops various implementations from "helpfully" assuming things and crash at schema resolution time instead of sometimes after fetch/fs call with a cryptic error message.

The same concept applies, but instead of using file: URIs, you can use http(s): URIs (https://localhost:3000/schemas/account). Of course that requires you host those schemas at some URI in your website.

Introducing unknown amount of waterfalling http calls just to compile a validation function is a pretty bad idea for the same reason running browser ESM without bundling is bad. Considering one of the videos on json schema youtube channel said they had ~100 levels of nesting (although I only had ~5 in my private hello world repo), no static server will tolerate additional 5-100 fetch calls on each page transition. It will either error out and break the whole chain anyway or shape traffic to the point it will result in a janky UX.

so this is probably running in a browser

It does but it's not relevant to the subject at hand. I just import the schema files as js modules which then get inlined into the bundle at build time, so for the purpose of the code I feed the schemas as js objects to the addSchema() functions. No intention of runtime or even build time fetching down the line.

jdesrosiers commented 1 year ago

How are you supposed to reference other schemas within schemas without "$id" value set?

You reference them by their retrieval URI. Think of an HTML document in a browser. The URI you use to retrieve the HTML is the base URI for document. Any relative-reference URIs are resolved against that base URI and retrieved (usually with HTTP, but other URI schemes are usually supported as well). Referencing in a schema works exactly the same way, except you can manually determine how a URI resolves to a schema using the retrievalUri argument of the addSchema function skipping normal URI scheme-based resolution such as making an HTTP request.

Using the retrievalUri argument instead of $id, you're still assigning identifiers to all of your schemas and referencing schemas the same way, you're just assigning that identifier in a different way. I prefer the retrieval URI approach to the $id approach because it's simple, is similar to how all other web technologies work, and the same pattern works in cases where you actually do want to retrieve schemas from the filesystem or the web.

In case it wasn't clear, although it's technically allowed by the spec and this library, there's no good reason to use both the retrievalUri argument and $id for the same schema. You should only use one at a time.

Introducing unknown amount of waterfalling http calls just to compile a validation function is a pretty bad idea

You're not wrong, but I think using appropriate HTTP cache header for your schemas addresses most of this concern. Also, as I understand it, HTTP/2/3 multiplexing and connection reuse features make sending many requests not the same kind of performance concern that it used to be. I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.

In any case, I admit that my suggestion is more applicable to the filesystem than the web, which was what I thought was the case when I first mentioned it. At the least, you'd have to change the way you've organized things because there are different trade-offs in play.

I see that you're not comfortable with the approach of using normal URI scheme-based resolution. That's totally fine. This library supports multiple approaches and you're free to choose which works best for your situation.

jdesrosiers commented 11 months ago

I just released an update to improve the experience when not declaring a dialect. You'll now get an error with the following message,

Unable to determine a dialect for the schema. The dialect can be declared in a number of ways, but the recommended way is to use the '$schema' keyword in your schema.

GabenGar commented 11 months ago

You reference them by their retrieval URI. Think of an HTML document in a browser. The URI you use to retrieve the HTML is the base URI for document. Any relative-reference URIs are resolved against that base URI and retrieved (usually with HTTP, but other URI schemes are usually supported as well). Referencing in a schema works exactly the same way, except you can manually determine how a URI resolves to a schema using the retrievalUri argument of the addSchema function skipping normal URI scheme-based resolution such as making an HTTP request.

What is the source of truth for the retrieval URI? Also I am thinking of JSON schemas as a fancy input validation DSL, not as something related to web documents in a browser. retrieval URI forces the schemas themselves to be aware of retrieval specifics at declaration time, when the "real" URL will be known at best at build time.

Using the retrievalUri argument instead of $id, you're still assigning identifiers to all of your schemas and referencing schemas the same way, you're just assigning that identifier in a different way. I prefer the retrieval URI approach to the $id approach because it's simple, is similar to how all other web technologies work, and the same pattern works in cases where you actually do want to retrieve schemas from the filesystem or the web.

Clearly it's not "simple" because it is a source of confusion in this very issue right now. Also I wouldn't call anything related to HTTP/file systems simple. Especially file systems, since Windows and Linux don't agree even on basic things like case sensitivity and path separators, so it's easy to end up in a situation where the retrievalUri can resolve differently depending on host OS. It's much more simple to assume "$id" is just a JSON string which can be checked for strict equality in all languages, instead of relying on built-in url parser/http/fs capabilities. Also fragment being important for the schema resolution while being completely ignored on the web doesn't help with "it's just an URL, bro" argument.

You're not wrong, but I think using appropriate HTTP cache header for your schemas addresses most of this concern. Also, as I understand it, HTTP/2/3 multiplexing and connection reuse features make sending many requests not the same kind of performance concern that it used to be. I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.

Majority of internet runs on HTTP 1.1, and HTTP 2 has its own security issues (such as completely nullifying network jitter capabilities of TCP). Regardless of protocol, it's still megabytes of data which doesn't even need to be sent because its known at build time.

I also find the claim of getting anywhere near 100 levels of nested references dubious and at best an extreme outlier, but if that's a situation where you find your self, I agree this probably isn't the right approach.

That's not my claim, it's either in one of those interview videos on youtube channel or somewhere in json schema related discussions. I am merely at 5 levels, which would still upset nginx at default config.

jdesrosiers commented 11 months ago

I am thinking of JSON schemas as a fancy input validation DSL, not as something related to web documents in a browser.

I think of JSON Schema as a validation DSL as well, but it's also built on web technologies. It can be both.

Clearly it's not "simple" because it is a source of confusion in this very issue right now.

There were two separate questions on StackOverflow just in the last few days of people confused about why their file-relative references aren't working. I see this kind of question in some channel or another about once every two weeks on average. There are definitely a lot of people that expect this behavior and find the self identification concept foreign.

it's easy to end up in a situation where the retrievalUri can resolve differently depending on host OS

This isn't a problem because JSON Schema works with URIs which is universal. This library translates the URI into a file system path appropriate to the environment it's running in.

It's much more simple to assume "$id" is just a JSON string which can be checked for strict equality in all languages

I see that self-identification makes more sense to you. That's fine. Feel free to use what's most comfortable for you.

fragment being important for the schema resolution while being completely ignored on the web doesn't help with "it's just an URL, bro" argument.

JSON Schema uses fragments exactly the same way they are used on the web. In HTML, you can set an "id" on an element and when used in a URI fragment, the browser moves to that position in the document. JSON Schema works the same way. The application/schema+json media type defines a behavior for the URI fragment including JSON Pointer support. The JSON Schema validator retrieves the whole schema document, but sets the view (so-to-speak) of document (schema) to the sub-schema the fragment points to.


I suggest we close this issue at this point. The original issue of the confusing error message has been addressed and we're off on a tangent. I'm sorry you didn't find my suggestion helpful. Please do continue to use self-identification if that's what you're most comfortable with. That functionality is present and fully functional. I will however, continue to provide URI scheme based retrieval for those who find that more useful. If you want to be 100% sure you don't accidentally bump into scheme based retrieval, I suggest using a URI scheme other than file: or http(s): such as urn:, tag:, or even something custom. Those of us working the JSON Schema spec have been discussing the idea of promoting the use of schema: URIs rather than URIs that imply a location that might not exist.