halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser
Apache License 2.0
1.08k stars 65 forks source link

Multiple JSON Pointers #61

Closed fwolfsjaeger closed 2 years ago

fwolfsjaeger commented 3 years ago

I also need the "unused" (not iterated) JSON, so I've extended your code to store this in a variable. I hope you don't mind.

halaxa commented 3 years ago

Thanks for being interested in this project. To be honest, I am concerned about 2 things.

  1. Contents of getUnusedJson seem a little unintuitive. Doesn't it?
  2. Filling up storedJson goes against the philosophy of this library. How do you prevent its size from getting out of available memory?
fwolfsjaeger commented 3 years ago
  1. You're right, if we're iterating over an array the method will return something like "[,,,,,]". I will fix this so that it returns "[]".
  2. It does consume additional memory, but other than writing to a temporary file I can't get that data without introducing a breaking change. I will check this again tomorrow.
halaxa commented 3 years ago

Let's talk essentials. What do you need the unused JSON parts for?

fwolfsjaeger commented 3 years ago

I have to split a large JSON containing meta-data, which I need once the data is processed.

Example:

{
    "meta_data": {
        "total_rows": 5,
        "chunk_size": 2,
        "created_at": "2021-08-11T01:00:00+00:00"
    },
    "companies": [
        {
            "id": "1",
            "company": "ACME 1",
            "email": "office@acme1.com"
        },
        {
            "id": "2",
            "company": "ACME 2",
            "email": "office@acme2.com"
        },
        {
            "id": "3",
            "company": "ACME 3",
            "email": "office@acme3.com"
        },
        {
            "id": "4",
            "company": "ACME 4",
            "email": "office@acme4.com"
        },
        {
            "id": "5",
            "company": "ACME 5",
            "email": "office@acme5.com"
        }
    ]
}

Output 1:

{
    "meta_data": {
        "total_rows": 5,
        "chunk_size": 3,
        "created_at": "2021-08-11T01:00:00+00:00",
        "total_files": 2,
        "file_number": 1
    },
    "companies": [
        {
            "id": "1",
            "company": "ACME 1",
            "email": "office@acme1.com"
        },
        {
            "id": "2",
            "company": "ACME 2",
            "email": "office@acme2.com"
        },
        {
            "id": "3",
            "company": "ACME 3",
            "email": "office@acme3.com"
        }
    ]
}

Output 2:

{
    "meta_data": {
        "total_rows": 5,
        "chunk_size": 3,
        "created_at": "2021-08-11T01:00:00+00:00",
        "total_files": 2,
        "file_number": 2
    },
    "companies": [
        {
            "id": "4",
            "company": "ACME 4",
            "email": "office@acme4.com"
        },
        {
            "id": "5",
            "company": "ACME 5",
            "email": "office@acme5.com"
        }
    ]
}
fwolfsjaeger commented 3 years ago

What do you think about extending the parser to allow multiple JSON pointers? Parser.getIterator() would have to yield an array / object then.

halaxa commented 3 years ago

I have had exactly this idea for some time and I think that would solve your problem. In your case, you would specify JsonMachine::fromFile(['/meta_data', '/companies'], 'file.json') instead of a single string with a JSON pointer. Is that correct?

Parser::getIterator() would not need to yield a multilevel structure though. I was wondering about method Parser::getCurrentJsonPointer() which would return matched path for the current element being yielded. It would simplify user iteration as well as implementation. What do you think?

fwolfsjaeger commented 3 years ago

It would solve my problem, right. Using Parser::getCurrentPath() sounds good as well. If you don't mind, I will amend the code to allow multiple pointers.

halaxa commented 3 years ago

Feel free to play with it. However, there should be two additional methods, not just one. Firstly Parser::getCurrentJsonPointer(), which returns JSON pointer for the current element being iterated with properly encoded ~1 and ~0, and something like Parser::getMatchedJsonPointer() which returns exactly one of those specified in the constructor. The reason is you can currently use * in JSON pointer as a wildcard for any array element and the user needs to know which of his JSON pointers is being matched. Does it make sense? Also, please be careful about performance.

If you preferred me to do it or simply do not have time to do it, you can solve your situation right now by simply iterating your JSON file twice. Firstly iterate the file over /meta_data and then over /companies. Don't worry, it will not parse the whole file twice, as in the first case, the parser will stop parsing right after it finishes iteration of the /meta_data subtree.

fwolfsjaeger commented 3 years ago

I have removed the "unused JSON logic" and extended the JSON pointer to allow multiple values. However, I have only added Parser::getMatchedJsonPointer(), which returns the original JSON pointer string.

Also, I have amended the Parser Unittest, which I could only test locally after upgrading to PHPunit 9.5.8. The PHPUnit upgrade is not part of this PR though.

What do you think?

halaxa commented 3 years ago

Can you please split it into 2 commits for easier orientation in the code? 1. Revert of the "unused" and 2. Your new implementation? Feel free to force push, thanks.

fwolfsjaeger commented 3 years ago

Sure, I've reverted my initial commit. Then I found an issue in the new implementation, which took quite a while to fix. I have added another Unittest for it.

halaxa commented 3 years ago

Thank you. I apologize for not replying. I have a kind of free-ish time of the year. I'll get to it as soon as I can. In the meantime, can you please amend README as well so it's in sync with these changes? It would be greatly appreciated. Thank you very much.

fwolfsjaeger commented 3 years ago

Sure, I've amended the README with a description of the multiple-pointer logic.

fwolfsjaeger commented 2 years ago

Did you have time to check the changes?

halaxa commented 2 years ago

I have more spare time now. Yesterday I was working on #62, so it is coming.

fwolfsjaeger commented 2 years ago

Conflicts resolved by rebasing your changes ...

halaxa commented 2 years ago

Thanks. I'll try to get to it asap.

fwolfsjaeger commented 2 years ago

Can you commit the failing test?

Yes, along with a solution. I have to keep track of an additional $currentPathWildcard variable though.

fwolfsjaeger commented 2 years ago

@halaxa I've merged your latest changes and resolved the conflicts.

halaxa commented 2 years ago

Thanks. Does it mean it is ready to merge from your side?

fwolfsjaeger commented 2 years ago

It works and all tests pass, but I guess it would make sense to run some performance-tests. Do you have existing tests for that?

halaxa commented 2 years ago

Good point. Just run make performance-tests (uses docker in the background) or composer performance-tests (no docker). You can compare the results with and without your feature. I've done some performance tweaking on Lexer so that 0.8.0 will have about 150% performance of 0.7.* (what took 10s will now take about 7s). So slight performance decrease of this feature will hopefully fit in so that it stays on the plus side.

fwolfsjaeger commented 2 years ago

Thanks for the hint! I've run composer performance-tests 10 - 20 times for each branch and didn't notice a significant difference.

halaxa commented 2 years ago

Thanks. Can you, please, look at the CHANGELOG and accommodate the changes - which are coming with version 0.8.0 -accordingly?

fwolfsjaeger commented 2 years ago

My branch contains all changes from 0.8.0 now.

halaxa commented 2 years ago

You are a very zealous coder. So before you commit anything, let's talk first. :) What if we added option pointers only accepting arrays (in case of multiple pointers) and pointer would keep only accepting strings to enhance clarity and readability to end-users?

fwolfsjaeger commented 2 years ago

We would have to add another array key to the "from" methods as well. Also, for the Parser class it doesn't make any difference.

What about adding a copy of the Items class - named something like DiverseItems or VaryingItems - instead of just a dedicated constructor argument?

halaxa commented 2 years ago

We would have to add another array key to the "from" methods as well.

Yes, they are all checked in one place.

Also, for the Parser class, it doesn't make any difference.

Also yes - no problem. This would be just convenient naming in Items options for the developers. I'd keep Parser's constructor as it is.

halaxa commented 2 years ago

Let's revive this idea:

What if we force users to provide the pointers in the order of their appearance in the document? The code would just sequentially advance to the next pointer when one finishes. The pointer would remain constant in a sense as the code would still be checking one pointer at a time. The code would get much simpler and no intersection checks would be necessary. If the document ends and some pointers remain a warning may be emitted. Another upside would be that the pointers could also be provided in a form of an endless iterator. What do you think?

Am I right that it would only require some change here and there and yet it would cover 99.9 % of the use cases? We would only advance to the next JSON pointer here if a next JSON pointer is available. Everything would be much much simpler.

fwolfsjaeger commented 2 years ago

I am fetching data from a REST endpoint that dynamically creates the data based on a last-modified timestamp. Unfortunately, in that case I don't know the order. However, I've tried to change it to check the performance, but I could not get it to work like that for multiple JSON pointers.

I could remove the check for intersecting JSON pointers, but this really doesn't cost much time/resources. However, I can add a check so that the intersection check is only executed if multiple pointers are provided.

Also, we only need to call the method getMatchingJsonPointerPath() if there are multiple JSON pointers. For huge JSON documents with only a single pointer this might actually make a difference. With those changes the performance should be the same as with your original code, unless there are multiple JSON pointers.

halaxa commented 2 years ago

I was only reviving the idea to make the parser less complex. I wasn't concerned about the performance. But since your own use case would be the 0.01 %, let's go with it. Let me just do some final reviews ;)

fwolfsjaeger commented 2 years ago

Some additional small changes:

  1. I've renamed the variables I've introduced for consistency.
  2. I've reverted a backwards-incompatible change I've made: JsonMachine\Parser::getJsonPointer() returns a string just like before again.
  3. I've introduced the check variable $singleJsonPointer so that the performance for single pointer operations is not affected by my changes.
halaxa commented 2 years ago

Items::getMatchedJsonPointer() should stay, shouldn't it?

halaxa commented 2 years ago

See #issuecomment-896821976

fwolfsjaeger commented 2 years ago

Items::getJsonPointer() calls Parser::getJsonPointer(), which returns the current JSON pointer.

halaxa commented 2 years ago

Tomorrow :)

halaxa commented 2 years ago

Items::getJsonPointer() calls Parser::getJsonPointer(), which returns the current JSON pointer.

Yes, however, that's a major change in behavior. Previously it was just a simple property getter. That's why I think it should be:

getJsonPointer(): string - keep the behavior (deprecate the method?)

getJsonPointers(): array - returns an array of all provided JSON pointers (simple property getter)

getCurrentJsonPointer(): string - returns the currently iterated path in the document in JSON pointer format (null or Exception outside of a foreach?)

getMatchedJsonPointer(): string - returns currently matched JSON pointer - one of the provided ones (null or Exception outside of a foreach?)

What do you think? Do better names come to mind?

fwolfsjaeger commented 2 years ago

Actually, I've broken backwards compatibility by returning an array of all JSON pointers in Parser::getJsonPointer(). With my latest changes Parser::getJsonPointer() returns the current JSON pointer as a string, just like before.

Current state: Parser::getJsonPointer(): string - returns the current JSON pointer (if there is only one it behaves just like before) Parser::getJsonPointers(): array - returns an array of all JSON pointers Parser::getJsonPointerPath(): string - returns the current JSON pointer path (if there is only one it behaves just like before) Parser::getJsonPointerPaths(): array - returns an array of all JSON pointers paths

To make it clear that the current JSON pointer [path] is returned we could deprecate the existing methods and add them with other names. I would keep the current state, but it is your project and therefor your call ;)

halaxa commented 2 years ago

Parser::getJsonPointerPath(): string was there only for testing reasons. It was not meant to provide any useful info as a part of the API. I think it just remained there as an artifact a remnant of initial TDD. It uncovers internal implementation and I guess no one needs this format for anything. Therefore I would not introduce Parser::getJsonPointerPaths(): array and deprecate Parser::getJsonPointerPath(): string.

Parser::getCurrentJsonPointer(): string and Parser::getMatchedJsonPointer(): string are a different story. They specifically say what exactly they do and are very useful in the foreach when someone iterates over multiple JSON Pointers or uses - as a wildcard. They tell you the current position and which JSON Pointer exactly matched this position. They also don't introduce a new format of JSON document path but rather provide already known JSON Pointer. That can be directly put back to the pointer option of any from* factories without any modification if one wishes.

Does it make sense like this?

fwolfsjaeger commented 2 years ago

Makes sense, I'll make the following changes then:

  1. deprecate Parser::getJsonPointer()
  2. remove Parser::getJsonPointerPath()
  3. add Parser::getCurrentJsonPointer() returning the current path (no wildcards) in JSON pointer format
  4. and Parser::getMatchedJsonPointer() returning the currently matched JSON pointer
halaxa commented 2 years ago

Awsome, thanks. Also, I think that Parser::getCurrentJsonPointer() and Parser::getMatchedJsonPointer() should throw if used outside of a loop.

fwolfsjaeger commented 2 years ago

I agree, both throw a JsonMachineException if called outside the foreach loop.

I've added 3 new tests, one to check the return values of the 2 new methods and two more to check for the exceptions. They have to be separate as checking the exeption doesn't work otherwise.

halaxa commented 2 years ago

Thanks. I don't see Parser::getJsonPointers(): array though. Did you push everything?

fwolfsjaeger commented 2 years ago

I have not added Parser::getJsonPointers() as it would just return the same array that has been provided. If you want, I can add it though.

halaxa commented 2 years ago

Yes, that's the goal. The users should be able to get what they put in. Thanks

halaxa commented 2 years ago

I added dome edge case tests to the new getters and one of them is failing. Can you please extend the functionality so it conforms to the JSON Pointer standard? Root level iteration is actually an empty-string JSON Pointer. Thanks.

fwolfsjaeger commented 2 years ago

You are right, I've amended the method.

halaxa commented 2 years ago

Not sure if you were notified. Can you check this comment, please?

fwolfsjaeger commented 2 years ago

Sorry, but I got no notification and can't see any comment.

halaxa commented 2 years ago

No problem, I was just asking why did you remove ['/', '{"":{"c":1,"d":2}}', ['/', '/']] test case in ParserTest?