Get and set object values by path

LeaVerou commented 8 months ago

Background

This came up in #18 as an implementation detail, but I think it's incredibly useful in its own right.

Sample use cases and prior art:

Mavo.subset()
Lodash’s set() and unset() functions (and their various permutations)
Underscore’s get()
Bliss’s $.value()

I pushed two util functions a few days ago (getByPath() and setByPath() but I think we should clean up the code, make it more flexible, and expose at the top level.

Some design decisions, requirements, questions below.

Signature

Function names

I’m leaning towards the simple get() and set() that are already established in prior work. Is there anything else we may want to save the names get() and set() for?

Arguments

Strawman:

get(obj, path [, options])
set(obj, path, value [, options])

Might be worth to later have overloads that allow specifying value, path as part of the options object, but don't see compelling motivation to include that in the MVP.

Should there be a way to set value via a function that takes the path and current value as parameters? Or is it encroaching too much into transform() territory at this point?

Path structure

Data type

Paths should be provided as arrays, we don’t want to deal with string parsing and trying to distinguish paths from property names. Strings/numbers should be accepted as well, but they’re just a path of length 1.

We may want to also support objects to provide additional metadata (see below).

Predicates

It seems obvious that entirely literal paths will not suffice (at the very least we need wildcards). Should we just use JSON Path? Hell no! First it's overkill for these use cases, and second once you go beyond literal property names + wildcards, the syntax becomes cryptic AF. And despite its complexity, there are some pretty common use cases like case insensitivity it doesn’t seem to support.

So since we can’t just use JSON Path, what do we use? What predicates do we want to support? Examples:

Wildcards (any property at this level)
Case-insensitive property names?
Alternatives? (e.g. "foo or bar")
Ranges of numbers? (e.g. "top 3 items")
Property queries (e.g. "get items with id=foo") — essentially the path version of CSS :has(), so we'd probably want to frame it that way, i.e. "children that match this path", so I’ll call them child queries from now on
Property names that start/end with a given string?
Property name regex?

We generally want to keep the MVP simple until use cases emerge, but it helps to take these things into account at the design stage so that the API has room to expand.

As mentioned above, wildcards are certainly needed. Case-insensitive matching might be worth to include in the MVP, since at least the Mavo use cases need it. The rest we can probably ship without and add as needed.

Syntax for predicates

So that begs the question, how do we express these predicates?

Special syntax. This works decently for some of them:

Wildcards: *
Alternatives: foo|bar
Number ranges: 0-3 or 0 .. 3
Property queries: id=foo

However, but there is no obvious fit for any of the others. Also, inventing a new microsyntax has several drawbacks:

The larger the syntax space for special syntax, the more challenging to distinguish it from literal property names. How do you do the escaping? Backslashes? How do you distinguish a literal "\*" property then? More backslashes? It's backslashes all the way down!
Devs would need to use string concatenation when the criteria is variable, which is awkward. E.g. Mavo paths support property queries like id=foo and I now think that's a terrible idea and we dropped that kind of support from get() (it's now only supported in mv-path, which being an HTML attribute it only takes strings so it can't take anything more structured).
It forces you to come up with syntax for things where there is no obvious syntax to use, resulting in a cryptic language.

So instead, I think we should go with an approach of strings for literals + wildcards as the only exception, since these are very common and have a very obvious syntax. Anything else would require making that part of the path an object literal.

This means even if we only ship wildcard as the only predicate, we need to support object literals at least to escape that and specify that something is a literal property name. If we have that escape hatch, we could in the future explore more options to add syntax for certain things where a readable syntactic option is obvious, as a shortcut (e.g. "foo|bar" for alternatives)

Predicate schema

Strawman for all of the above predicates (even though we don't plan to implement them all):

Path: string | (string | PathSegment)[]
PathSegment: Object with keys (all optional):
- name: Literal property name (string) but maybe could also be a RegExp?
- ignoreCase (boolean)
- range: Numerical range (number[2] or {from, to} or even {gt, gte, lt, lte}?)
- or: Alternatives ((string | PathSegment[]))
- has: Return only children for which this would be non-empty (Path)
- startsWith
- endsWith
- regexp

Notes:

ignoreCase is special. All other criteria are independent, but ignoreCase affects how other criteria work, i.e. is a modifier rather than a predicate:
- name: from strict equality to equality after .toLowerCase()
- regexp: Adds the i flag if not present
- startsWith/endsWith: applies .toLowerCase() before matching
- or and has: inherits to any path segments that don't have their own ignoreCase
- Are there any other modifiers that we may conceivably want to support in the future (so we can take them into account in the design)?
Multiple independent criteria can be specified, and the result is the intersection. This way, since we already have or complex logical criteria can be created by just nesting these. 😁
Should we also handle arrays as sugar for {or: array}?

How do predicates work with `set()`?

Setting is only an issue for the last part of the path — until then it's still a getting task.

So if the last part of the path is a…

Wildcard: Set every property that exists?
Alternatives: Set every property among the alternatives or only those that already exist on the object?
Numerical ranges: set every number in the range?
Child queries: 🤷🏽‍♀️ Replace these objects with the value?
Regexps or Starts/ends with: 🤷🏽‍♀️🤷🏽‍♀️🤷🏽‍♀️

Return value

One ore more values? For static paths, there can only be a single return value. However, when predicates are involved, there could be multiple, and it's impossible to tell whether one or more values are expected.
Array or object subset?: when returning multiple values, how much of the original object structure do we want to preserve? There are use cases for getting a completely flat array, and use cases for subsetting the object, i.e. using the path as an allowlist of properties while preserving the original object structure.

Following the design principle that function return values should not vary wildly based on the options passed, perhaps we actually need more than just a single get() function:

get(): Array of all values
first(): First value only
subset(): Subset of object

Or perhaps get() for one value and getAll() for multiple?

Options for the whole path

These will be passed to the functions as part of the options dictionary.

Case insensitive matching (when we want it for the whole path)
set() only: What object to create when part of the path doesn’t exist? {} by default. Might be useful to take a function to customize based on the path.
We definitely don't want to throw if the path doesn't exist, since avoiding that is one of the primary reason to use such a helper. Is there value in having an opt-in to stricter handling?

adamjanicki2 commented 7 months ago

Here are some of my thoughts on everything above:

Function names

I also think these are the simplest applications of getting and setting, just getting and setting arbitrary nodes, so I'm on board with get and set as names, I'm not sure if there are any other applications that they would fit better in as a name.

Or perhaps get() for one value and getAll() for multiple?

I like this so that we're not returning a single node in the generic case and an array in the case where wildcards/other more complex operations were involved.

How do predicates work with set()?

I think by default set should only set paths if they exist, including the case where a wildcard/other expression comes into play, then it should set all matching and existing paths. Then we can allow an optional param, something along the lines o.setNonexistent, which would enable an author to tell us to set a path even if it does not exist.

But now that I'm thinking about it, setting a non-existent path is challenging because we do not know for sure how to add nodes to their tree structure. For example, what if their node is a custom class? We couldn't simply create objects/properties to create this path. So we'll have to think about this case more. Maybe this is a sign that this may be a use case that we should wait on verifying that we need need to support it?

Predicate schema

I like the idea of having this since it provides flexibility to add new operations and features in the future, and would allow us to start with simple and common usecases first and add new ones as they arise

LeaVerou commented 7 months ago

Let’s start simple. Paths are arrays with values:

string | number: wildcard or property name
{} (empty object): same as wildcard
{name: string | number}: literal name (i.e. {name: "*"} is not a wildcard).

Thoughts?

LeaVerou commented 7 months ago

Wrt setting, the idea is we'd use {} as a default, but users can customize it

adamjanicki2 commented 7 months ago

Let’s start simple. Paths are arrays with values:

string | number: wildcard or property name

{} (empty object): same as wildcard

{name: string | number}: literal name (i.e. {name: "*"} is not a wildcard).

Thoughts?

I like it, it's simple and easy to understand

adamjanicki2 commented 7 months ago

@LeaVerou A few more clarifying points on get before implementing:

Should this function be calling context.getChildProperties in the case of a wildcard?
Should this function be checking that a key in the path is actually a valid key of that node (i.e. checking that context.isNode(node[key])) before exploring further?
In the case where node is something like {left: {name: "leaf1"}, right: {name: "leaf2"}}, and path is ["*", "nonexistentKey"], should it return [] since after the wildcard nothing matched nonexistentKey (meaning the path was not valid)?
In the case where there are no wildcards in the path, and the path does not exist, should it return undefined or null?

Just wanted to get your opinion on these things, for 1-3, my answer would be yes, it should do those things, and for 4, I would lean toward returning undefined.

LeaVerou commented 7 months ago

After thinking about this some more, I wonder if we could get rid of all this complexity and just have an array of properties that point to one or more children. The nodes that have a single children property (or whatever it's called) are basically special cases of how children work in ASTs, since there you have nodes that point to single children OR arrays of children. The only wart is how to figure out whether node[childProperty] points to a single child node or a data structure containing many children, but that's what isNode() is for!

adamjanicki2 commented 7 months ago

After thinking about this some more, I wonder if we could get rid of all this complexity and just have an array of properties that point to one or more children. The nodes that have a single children property (or whatever it's called) are basically special cases of how children work in ASTs, since there you have nodes that point to single children OR arrays of children. The only wart is how to figure out whether node[childProperty] points to a single child node or a data structure containing many children, but that's what isNode() is for!

I like this idea much better than having a wildcard operator and all the complex syntax for defining it versus "*" as a standard key

adamjanicki2 commented 7 months ago

So if I'm understanding your idea correctly, get would look like function get(node, path) where path is Array<string | number>, and among those properties could be something like children, where itself is not a node, but contains node since it's either an object or an array, in which case we'd return all of them.

For set(node, path, value), it would be similar, except one question I have is what to do if the path ends with a type that's not a node but contains them, for example, path = ["children"]. In this case, should it set all nodes inside children to value?

LeaVerou commented 7 months ago

It means get() and set() are not on the critical path any more.

LeaVerou commented 6 months ago

@adamjanicki2 what happened with this? Being able to set how to get from parent to children in a more generic way is pretty essential.

adamjanicki2 commented 6 months ago

@adamjanicki2 what happened with this? Being able to set how to get from parent to children in a more generic way is pretty essential.

What does this mean? Are you referring to general set/get functions or something else?

mavoweb / treecle