halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser
Apache License 2.0
1.1k stars 65 forks source link

Document-dependent path evaluation #102

Open jakajancar opened 1 year ago

jakajancar commented 1 year ago

Let's say we have a property with an array of dynamically-typed, user-provided parameters.

$options = ['pointer' => '/user-provided-parameters/-'];

Items::fromString('{"user-provided-parameters": [1,2,3]}', $options);
// Expected: [1,2,3]
// Actual: same

Items::fromString('{"user-provided-parameters": [1,false,null]}', $options);
// Expected: [1,false,null]
// Actual: same

Items::fromString('{"user-provided-parameters": [1, [2,3]]}', $options);
// Expected: [1, [2,3]]
// Actual: [1, 2, 3]

Items::fromString('{"user-provided-parameters": []}', $options);
// Expected: []
// Actual: Exception: Paths '/user-provided-parameters/-' were not found in json stream.

This makes JSON Machine not very useful for working with documents with a more dynamic schema. Moreover, even arrays have a special case at length == 0.

I checked the JSON Pointer spec to see if this is an implementation bug or by design. Seems like JSON Pointer is not intended for the (JSONPath-like) selection at all, but for navigation to a single node. Even the - is interpreted differently (the (nonexistent) member after the last array element vs a wildcard which matches any array index). It also would not have the above problem and would always navigate to the expected subtree. It would be better if the readme said "a syntax inspired by JSON Pointer".

Re. a solution, it would be great if there was an option to not automatically descend deeper than the specified path and make the subtree selection not dependent on the values in it.

halaxa commented 1 year ago

TL;DR Just omit the dash at the end.

Hi. Your example works as expected. It seems in your case the JSON Pointer (pointer option) is just not used correctly. The pointer option means "iterate over items in this element". If you only need to iterate over the items in the user-provided-parameters key, just use /user-provided-parameters as the pointer. The dash at the end means "any index" so it matches /user-provided-parameters/0, /user-provided-parameters/1, and so on, and then tries to iterate over what's inside a vector on that index. If you need more explanation, let me know or have a second look at the JSON Machine documentation.

jakajancar commented 1 year ago

Thanks for the quick response! You're right.

I tried to reduce the case and did it incorrectly. Let me try again:

Let's say we have a number[][] matrix where we want to iterate through cells, same as:

function cells($matrix) {
    foreach ($matrix as $row) {
        foreach ($row as $cell) {
            yield $cell;
        }
    }
}
$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[1,2], [3,4]]}', $options);
// Expected: [1,2,3,4]
// Actual: same

Items::fromString('{"table": [[1,2], 3]}', $options);
// Expected: error
// Actual: [1,2,3]

Is this possible?

jakajancar commented 1 year ago

And the reason I was using /table/-/- was because then you get nice results in getCurrentJsonPointer():

1  -  /table/0/0
2  -  /table/0/1
3  -  /table/1/0
4  -  /table/1/1
jakajancar commented 1 year ago

What are your thoughts on an option "flatten" => false (default true), where of your examples:

JSON Pointer value Will iterate through
(empty string - default) ["this", "array"] or {"a": "this", "b": "object"} will be iterated (main level)
/result/items {"result": {"items": ["this", "array", "will", "be", "iterated"]}}
/0/items [{"items": ["this", "array", "will", "be", "iterated"]}] (supports array indices)
/results/-/status {"results": [{"status": "iterated"}, {"status": "also iterated"}]} (a hyphen as an array index wildcard)
/ (gotcha! - a slash followed by an empty string, see the spec) {"":["this","array","will","be","iterated"]}
/quotes\" {"quotes\"": ["this", "array", "will", "be", "iterated"]}

All of them return a single item, except /results/-/status (with an explicit wildcard) returns the same as today?

halaxa commented 1 year ago

I'm not sure what the question is now. Can you be more specific?

Anyway, let me just elaborate a little on the flatten topic. JSON Machine supports finding data in a JSON down to a single scalar value if needed. It does that automatically. If it finds a scalar value at a pointer instead of an object or an array, it just yields it in a single iteration. So it might seem it somehow flattens the structure when used in combination with - and when the structure is not rigid. But in reality, no such thing happens.

Try this and you'll see no deep flattening is happening:

$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[[1,2]], [3,4]]}', $options);
// Expected: [[1,2],3,4]

Also, this example is not expected to produce an error:

$options = ['pointer' => '/table/-'];
Items::fromString('{"table": [[1,2], 3]}', $options);

because at /table/0 there is [1,2] which is sequentially iterated, and at /table/1 there is 3 which is a scalar value and as such it's simply yielded as a single value.

jakajancar commented 1 year ago

I would expect a behavior where:

Currently, even a non-wildcard component explodes the items (but has nowhere to indicate this in the path), if the element pointed to is an object/array. It is this behavior that I would like to have a way to disable.


Below is (yet another) example, which demonstrates both my concerns (indexes in getCurrentJsonPointer() and unpredictable levels).

Say you have two-level array mixed[][], where all of these are valid:

{"2d": [[1,2], [3]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
{"2d": [[1,2], [3,true]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
    $value['2d'][1][1] (/2d/1/1) = true
{"2d": [[1,2], [3,[4,5]]]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] (/2d/1/0) = 3
    $value['2d'][1][1] (/2d/1/1) = [4,5]

The following is not valid, because it's not really mixed[][]:

{"2d": [[1,2], false]}
    $value['2d'][0][0] (/2d/0/0) = 1
    $value['2d'][0][1] (/2d/0/1) = 2
    $value['2d'][1][0] = error

I would like to

  1. properly get the elements in the valid examples,
  2. know their indexes, and
  3. (ideally) somewhat gracefully handle the invalid example (error or ignore the non-matching value).

This cannot be currently achieved:

halaxa commented 1 year ago
halaxa commented 1 year ago

Sorry for being brief ;)

jakajancar commented 1 year ago

No worries, I appreciate your responses, responsiveness, and patience with me iterating on trying to get the best example.

  • If you use /2d/-/-

    • ❌ Third valid example ([[1,2], [3,[4,5]]]) gets flattened (and you get 5 items)

    • That's a feature, not a bug as explained earlier.

Yes, I understand. But disabling this feature is essentially my feature request! :D

  • If you use /2d/-:

    • ❌ You do not get both indices, only the first.

    • Ok, this seems weird. Can you give the exact output? Could it be the same problem as Why only red is output #100?

I'm not saying that the items do not get iterated over, just that in the getCurrentJsonPointer() return value you don't have both indices (which makes sense, since there is not "placeholder" for them).

  • ❌ The invalid example gets silently ignored (you get same items as first valid example)

    • Not-found items get ignored. That's normal behavior. It's as if you wanted the find command to fail on every existing file in the searched dir that does not match searched string.

By "silently ignored" I don't mean not returned by the iterator (that's what happens with /2d/-/- and that's OK) but returned identically than if it was in a different structure.


Perhaps I owe an explanation for this admittedly weird use-case:

I'm querying OpenAI's text completions AI with the new function calling/structured output mechanism, which returns JSON. JSON Machine is used to return results in a streaming fashion to the user live (see videos here if curious). That table should be string[][] and 95% of the time it is, but occasionally the model hallucinates and omits a level of nesting, adds a level of nesting, returns the wrong number of rows or cells. So when iterating over /2d/-/- I check both the indexes to be monotonically increasing with no gaps, that the values are indeed string, and so on... very defensively.


In recap, I don't think path nr# 2 (/2d/-) is the way forward. /2d/-/- is mostly there, but I would prefer not to have that auto-descent feature.

halaxa commented 1 year ago

But disabling this feature is essentially my feature request! :D

Now it makes perfect sense 😁. Because in terms of JSON Machine, there's no 'flattening', I'd suggest modifying the scalar parsing logic, which is what's actually behind your problem. Maybe an option something like iterate_scalars, with three settings:

This example of yours:

$options = ['pointer' => '/table/-'];

Items::fromString('{"table": [[1,2], 3]}', $options); // Expected: error // Actual: [1,2,3]

would then throw an error with option 'iterate_scalars' => NEVER

halaxa commented 1 year ago

Also for a less predictable structure maybe #36 would help?

halaxa commented 2 weeks ago

What do you think about the solution proposed above? (iterate_scalars option as a feature request)?