Closed fwolfsjaeger closed 2 years ago
Thanks for being interested in this project. To be honest, I am concerned about 2 things.
getUnusedJson
seem a little unintuitive. Doesn't it?storedJson
goes against the philosophy of this library. How do you prevent its size from getting out of available memory?Let's talk essentials. What do you need the unused JSON parts for?
I have to split a large JSON containing meta-data, which I need once the data is processed.
Example:
{
"meta_data": {
"total_rows": 5,
"chunk_size": 2,
"created_at": "2021-08-11T01:00:00+00:00"
},
"companies": [
{
"id": "1",
"company": "ACME 1",
"email": "office@acme1.com"
},
{
"id": "2",
"company": "ACME 2",
"email": "office@acme2.com"
},
{
"id": "3",
"company": "ACME 3",
"email": "office@acme3.com"
},
{
"id": "4",
"company": "ACME 4",
"email": "office@acme4.com"
},
{
"id": "5",
"company": "ACME 5",
"email": "office@acme5.com"
}
]
}
Output 1:
{
"meta_data": {
"total_rows": 5,
"chunk_size": 3,
"created_at": "2021-08-11T01:00:00+00:00",
"total_files": 2,
"file_number": 1
},
"companies": [
{
"id": "1",
"company": "ACME 1",
"email": "office@acme1.com"
},
{
"id": "2",
"company": "ACME 2",
"email": "office@acme2.com"
},
{
"id": "3",
"company": "ACME 3",
"email": "office@acme3.com"
}
]
}
Output 2:
{
"meta_data": {
"total_rows": 5,
"chunk_size": 3,
"created_at": "2021-08-11T01:00:00+00:00",
"total_files": 2,
"file_number": 2
},
"companies": [
{
"id": "4",
"company": "ACME 4",
"email": "office@acme4.com"
},
{
"id": "5",
"company": "ACME 5",
"email": "office@acme5.com"
}
]
}
What do you think about extending the parser to allow multiple JSON pointers? Parser.getIterator() would have to yield an array / object then.
I have had exactly this idea for some time and I think that would solve your problem. In your case, you would specify JsonMachine::fromFile(['/meta_data', '/companies'], 'file.json')
instead of a single string with a JSON pointer. Is that correct?
Parser::getIterator()
would not need to yield a multilevel structure though. I was wondering about method Parser::getCurrentJsonPointer()
which would return matched path for the current element being yielded. It would simplify user iteration as well as implementation. What do you think?
It would solve my problem, right. Using Parser::getCurrentPath() sounds good as well. If you don't mind, I will amend the code to allow multiple pointers.
Feel free to play with it. However, there should be two additional methods, not just one. Firstly Parser::getCurrentJsonPointer()
, which returns JSON pointer for the current element being iterated with properly encoded ~1
and ~0
, and something like Parser::getMatchedJsonPointer()
which returns exactly one of those specified in the constructor. The reason is you can currently use *
in JSON pointer as a wildcard for any array element and the user needs to know which of his JSON pointers is being matched. Does it make sense? Also, please be careful about performance.
If you preferred me to do it or simply do not have time to do it, you can solve your situation right now by simply iterating your JSON file twice. Firstly iterate the file over /meta_data
and then over /companies
. Don't worry, it will not parse the whole file twice, as in the first case, the parser will stop parsing right after it finishes iteration of the /meta_data
subtree.
I have removed the "unused JSON logic" and extended the JSON pointer to allow multiple values. However, I have only added Parser::getMatchedJsonPointer(), which returns the original JSON pointer string.
Also, I have amended the Parser Unittest, which I could only test locally after upgrading to PHPunit 9.5.8. The PHPUnit upgrade is not part of this PR though.
What do you think?
Can you please split it into 2 commits for easier orientation in the code? 1. Revert of the "unused" and 2. Your new implementation? Feel free to force push, thanks.
Sure, I've reverted my initial commit. Then I found an issue in the new implementation, which took quite a while to fix. I have added another Unittest for it.
Thank you. I apologize for not replying. I have a kind of free-ish time of the year. I'll get to it as soon as I can. In the meantime, can you please amend README as well so it's in sync with these changes? It would be greatly appreciated. Thank you very much.
Sure, I've amended the README with a description of the multiple-pointer logic.
Did you have time to check the changes?
I have more spare time now. Yesterday I was working on #62, so it is coming.
Conflicts resolved by rebasing your changes ...
Thanks. I'll try to get to it asap.
Can you commit the failing test?
Yes, along with a solution. I have to keep track of an additional $currentPathWildcard variable though.
@halaxa I've merged your latest changes and resolved the conflicts.
Thanks. Does it mean it is ready to merge from your side?
It works and all tests pass, but I guess it would make sense to run some performance-tests. Do you have existing tests for that?
Good point. Just run make performance-tests
(uses docker in the background) or composer performance-tests
(no docker). You can compare the results with and without your feature. I've done some performance tweaking on Lexer so that 0.8.0
will have about 150% performance of 0.7.*
(what took 10s will now take about 7s). So slight performance decrease of this feature will hopefully fit in so that it stays on the plus side.
Thanks for the hint! I've run composer performance-tests
10 - 20 times for each branch and didn't notice a significant difference.
Thanks. Can you, please, look at the CHANGELOG and accommodate the changes - which are coming with version 0.8.0 -accordingly?
My branch contains all changes from 0.8.0 now.
You are a very zealous coder. So before you commit anything, let's talk first. :) What if we added option pointers
only accepting arrays (in case of multiple pointers) and pointer
would keep only accepting strings to enhance clarity and readability to end-users?
We would have to add another array key to the "from" methods as well. Also, for the Parser class it doesn't make any difference.
What about adding a copy of the Items class - named something like DiverseItems or VaryingItems - instead of just a dedicated constructor argument?
We would have to add another array key to the "from" methods as well.
Yes, they are all checked in one place.
Also, for the
Parser
class, it doesn't make any difference.
Also yes - no problem. This would be just convenient naming in Items
options for the developers. I'd keep Parser
's constructor as it is.
Let's revive this idea:
What if we force users to provide the pointers in the order of their appearance in the document? The code would just sequentially advance to the next pointer when one finishes. The pointer would remain constant in a sense as the code would still be checking one pointer at a time. The code would get much simpler and no intersection checks would be necessary. If the document ends and some pointers remain a warning may be emitted. Another upside would be that the pointers could also be provided in a form of an endless iterator. What do you think?
Am I right that it would only require some change here and there and yet it would cover 99.9 % of the use cases? We would only advance to the next JSON pointer here if a next JSON pointer is available. Everything would be much much simpler.
I am fetching data from a REST endpoint that dynamically creates the data based on a last-modified timestamp. Unfortunately, in that case I don't know the order. However, I've tried to change it to check the performance, but I could not get it to work like that for multiple JSON pointers.
I could remove the check for intersecting JSON pointers, but this really doesn't cost much time/resources. However, I can add a check so that the intersection check is only executed if multiple pointers are provided.
Also, we only need to call the method getMatchingJsonPointerPath() if there are multiple JSON pointers. For huge JSON documents with only a single pointer this might actually make a difference. With those changes the performance should be the same as with your original code, unless there are multiple JSON pointers.
I was only reviving the idea to make the parser less complex. I wasn't concerned about the performance. But since your own use case would be the 0.01 %, let's go with it. Let me just do some final reviews ;)
Some additional small changes:
Items::getMatchedJsonPointer()
should stay, shouldn't it?
Items::getJsonPointer() calls Parser::getJsonPointer(), which returns the current JSON pointer.
Tomorrow :)
Items::getJsonPointer() calls Parser::getJsonPointer(), which returns the current JSON pointer.
Yes, however, that's a major change in behavior. Previously it was just a simple property getter. That's why I think it should be:
getJsonPointer(): string
- keep the behavior (deprecate the method?)
getJsonPointers(): array
- returns an array of all provided JSON pointers (simple property getter)
getCurrentJsonPointer(): string
- returns the currently iterated path in the document in JSON pointer format (null or Exception outside of a foreach
?)
getMatchedJsonPointer(): string
- returns currently matched JSON pointer - one of the provided ones (null or Exception outside of a foreach
?)
What do you think? Do better names come to mind?
Actually, I've broken backwards compatibility by returning an array of all JSON pointers in Parser::getJsonPointer(). With my latest changes Parser::getJsonPointer() returns the current JSON pointer as a string, just like before.
Current state:
Parser::getJsonPointer(): string
- returns the current JSON pointer (if there is only one it behaves just like before)
Parser::getJsonPointers(): array
- returns an array of all JSON pointers
Parser::getJsonPointerPath(): string
- returns the current JSON pointer path (if there is only one it behaves just like before)
Parser::getJsonPointerPaths(): array
- returns an array of all JSON pointers paths
To make it clear that the current JSON pointer [path] is returned we could deprecate the existing methods and add them with other names. I would keep the current state, but it is your project and therefor your call ;)
Parser::getJsonPointerPath(): string
was there only for testing reasons. It was not meant to provide any useful info as a part of the API. I think it just remained there as an artifact a remnant of initial TDD. It uncovers internal implementation and I guess no one needs this format for anything. Therefore I would not introduce Parser::getJsonPointerPaths(): array
and deprecate Parser::getJsonPointerPath(): string
.
Parser::getCurrentJsonPointer(): string
and Parser::getMatchedJsonPointer(): string
are a different story. They specifically say what exactly they do and are very useful in the foreach
when someone iterates over multiple JSON Pointers or uses -
as a wildcard. They tell you the current position and which JSON Pointer exactly matched this position. They also don't introduce a new format of JSON document path but rather provide already known JSON Pointer. That can be directly put back to the pointer
option of any from*
factories without any modification if one wishes.
Does it make sense like this?
Makes sense, I'll make the following changes then:
Awsome, thanks. Also, I think that Parser::getCurrentJsonPointer()
and Parser::getMatchedJsonPointer()
should throw if used outside of a loop.
I agree, both throw a JsonMachineException if called outside the foreach loop.
I've added 3 new tests, one to check the return values of the 2 new methods and two more to check for the exceptions. They have to be separate as checking the exeption doesn't work otherwise.
Thanks. I don't see Parser::getJsonPointers(): array
though. Did you push everything?
I have not added Parser::getJsonPointers()
as it would just return the same array that has been provided. If you want, I can add it though.
Yes, that's the goal. The users should be able to get what they put in. Thanks
I added dome edge case tests to the new getters and one of them is failing. Can you please extend the functionality so it conforms to the JSON Pointer standard? Root level iteration is actually an empty-string JSON Pointer. Thanks.
You are right, I've amended the method.
Not sure if you were notified. Can you check this comment, please?
Sorry, but I got no notification and can't see any comment.
No problem, I was just asking why did you remove ['/', '{"":{"c":1,"d":2}}', ['/', '/']]
test case in ParserTest
?
I also need the "unused" (not iterated) JSON, so I've extended your code to store this in a variable. I hope you don't mind.