unique id for schemas generated with Type.Object()

xddq commented 1 month ago

Hey @sinclairzx81, hope you are doing good!

I built a small package for validating data with typebox in a single line, with a throwing function which captures all errors in a single error and adds the string formats I typically need. I also added caching of the compiled schemas trying to get the best performance. However, I did not get an idea how find/create a suitable cache key for the given schema objects in a sane way to gain performance benefits. Therefore I am only caching compiled schemas whenever the schema has an $id field. It is "okayish" but not my preferred solution as the user will have to take care of this. Just for reference, the code is here. Anyway, I got these questions:

Is there a unique id for any given schema created with Type.Object()? If this is the case, how can I read it?
What do you think about creating and assigning a unique id to the object created with Type.Object()? Perhaps something like an internal field, possibly name could be __internalId or something similar. This could be an uuidv4, or perhaps the sha of the json.stringify representation? My understanding would be that this only has a one-time cost when creating the schema objects, which typically is done once and 'at startup' of the program. Is this something that you would consider adding..?

ehaynes99 commented 1 month ago

That's error prone because you don't have any real guarantee that the $id values are unique. Schemas should generally be defined in a static context, and Map will take an object as a key, so you can just use the schema as the key and the compiled TypeCheck as the value. JS has a WeakMap type that is particularly good for caching, because if the key has no other references, the entry will be garbage collected.

(I would actually think that this would be a good thing to add to the TypeCompiler itself... there's no risk of leaking)

import type { TSchema } from '@sinclair/typebox'
import { type TypeCheck, TypeCompiler } from '@sinclair/typebox/compiler'

const cache = new WeakMap<TSchema, TypeCheck<any>>()

export const getTypeCheck = <T extends TSchema>(schema: T): TypeCheck<T> => {
  let typeCheck = cache.get(schema)
  if (!typeCheck) {
    typeCheck = TypeCompiler.Compile(schema)
    cache.set(schema, typeCheck)
  }
  return typeCheck as TypeCheck<T>
}

So given a schema like:

type Customer = Static<Customer>
export const Customer = Type.Object({
  email: Type.String(),
  name: Type.String(),
  // ...
})

any other place in the codebase can use:

if (getTypeCheck(Customer).Check(data)) {
  console.log(data.name)
}

xddq commented 4 weeks ago

That's error prone because you don't have any real guarantee that the $id values are unique.

Yup, but it still seems to be the best option (for now..?)

Interesting, I did not use WeakMap yet. But I wonder, why would I want a garbage collected cache? I don't expect the schemas to be created dynamically or even change after application startup.

Also, is it really that fast to use an object as key? I thought about using JSON.stringify(obj) to create the cache key, but then again I assumed/guessed that this would take around the same time as Typecheck.Compile so I did not even bother with that approach.

(I would actually think that this would be a good thing to add to the TypeCompiler itself... there's no risk of leaking)

Agreed, it seems to be a somewhat low hanging fruit, but I have no idea if it really is. Therefore I created this issue 😅 However, I would just use a Map (no weakmap) and use the JSON.stringified object (or probably rather just generate a uuidv4 once) as the cache key.

Perhaps we just wait for a response of @sinclairzx81

sinclairzx81 commented 4 weeks ago

@xddq Hi, sorry for the delay on this, have been a bit busy with other projects recently. Ill try provide some answers below.

Is there a unique id for any given schema created with Type.Object()? If this is the case, how can I read it?

Currently, the only type that generates a unique identifier is Type.Recursive() as the $id is required to self reference via $ref. The id can be overridden (which is recommended) by passing an explicit $id on the recursive SchemaOptions. The $id itself is just a auto incrementing string value.

What do you think about creating and assigning a unique id to the object created with Type.Object()? Perhaps something like an internal field, possibly name could be __internalId or something similar. This could be an uuidv4, or perhaps the sha of the json.stringify representation? My understanding would be that this only has a one-time cost when creating the schema objects, which typically is done once and 'at startup' of the program. Is this something that you would consider adding..?

I actually tried to implement auto $id generation in a previous TB revision (where I attempted to get TypeBox auto dereferencing schematics throughout the library using an internal type cache (i.e. Map)), unfortunately the implementation didn't work out well so I never went ahead with it. In that implementation though, I wasn't very keen on generating identifiers that were not explicitly specified by the user; this is because users who didn't want them (i.e. those who publish their schematics) couldn't trivially opt out.

From the above, I reasoned it was generally better for users to opt-in by explicitly assigning the $id rather than TypeBox automatically handling that assignment internally. I would still like to explore auto deref functionality in TypeBox, it's just that the implementation of it shouldn't rely on $id generation.

I think @ehaynes99's suggestion of using either Map or WeakMap would be the way to go if you need to cache compiled schematics TypeCheck<T> or schematics. Once a type (or compiled check) is created, the JS reference to that type doesn't change; and it's possible to use a JS reference as a key on either Map and WeakMap ....so something like the following should work.

import { TypeCompiler, TypeCheck } from '@sinclair/typebox/compiler'
import { Type, TSchema } from '@sinclair/typebox'

const T = Type.String()

const C = new Map<TSchema, TypeCheck<any>>()

C.set(T, TypeCompiler.Compile(T))

const check = C.get(T)

console.log(check) // TypeCheck

Can the above work for your implementation? Cheers! S

ehaynes99 commented 4 weeks ago

But I wonder, why would I want a garbage collected cache?

If ALL of your keys are statically defined, then there is no difference between using Map and WeakMap, because the statically declared values will always exist. WeakMap simply protects you from a memory leak in cases where nothing else is holding on to the key, so in general is more suitable for a cache where the keys are objects. Unrelated to this specific case, but consider if you wanted to cache some data for the lifetime of an http Request. If you use the Request as the key in a standard Map, it would hold on to every single request and quickly run out of memory, but in a WeakMap, those values will be ejected from the cache without you having to explicitly delete them.

Also, is it really that fast to use an object as key? I thought about using JSON.stringify(obj) to create the cache key

It is EXTREMELY fast to use an object as a key. Remember that JavaScript compares objects only by identity, so a === b only if they're the same instance. Internally, the JS engine can just store its own internal identifier for that value, so the computational complexity of creating the key is ZERO for both insertion and lookup. JSON.stringify, on the other hand, is one of the most computationally complex actions that most applications do.

xddq commented 4 weeks ago

@ehaynes99 ahh, I see. Thats awesome, thanks!

@sinclairzx81 No worries, I am not in a hurry. Thanks for the quick response!

I wasn't very keen on generating identifiers that were not explicitly specified by the user; this is because users who didn't want them (i.e. those who publish their schematics) couldn't trivially opt out.

I see. But isnt this what strict should be doing? If you were to pick a fixed unlikely attribute name and then include stripping it with strict that would (at least with me knowing not to much about the insides) seem like a decent/valid way to get caching into the typecompile.compile().

Anyways my "problem/issue" was not really within typebox and was solved. So depending on whether you would want to add caching to the typecompiler or not, feel free to just close this. I would be down to try adding it within a PR, but probably only in several weeks and not now.

sinclairzx81 commented 3 weeks ago

@xddq Hi!

I see. But isnt this what strict should be doing? If you were to pick a fixed unlikely attribute name and then include stripping it with strict that would (at least with me knowing not to much about the insides) seem like a decent/valid way to get caching into the typecompile.compile().

The Strict function is mostly a carry over from a very old version of TypeBox which is used to omit compositing symbols from the types (it was added when Ajv mandated no unknown keywords, and was a quick way to omit symbols in the case Ajv or other validator complained about them (without having to configure the validator)).

You're right that symbols are additional properties on the schema, but they are fixed (non-random) and not auto generated (so you always get back the same symbol for the same type, which would be unlike auto generated identifiers (although granted, hashes would yield the same id, but unfortunately there is too much compute cost hashing each type)). This said, I do see value in auto generated identifiers for some workflows (including the codegen project), but their introduction would have too many flow on effects elsewhere, so they are on the back burner...at least for now.

Will close of this issue for now as auto generating identifiers are considered to be out of scope (for now), but there may be potential to include "something" via a TypeSystem.AutoGenerateId = true setting in future. This wouldn't be planned for in the short term tho.

All the best! S

sinclairzx81 / typebox

unique id for schemas generated with Type.Object() #892