Open isuhail-sray opened 11 months ago
Can you give an example of a specific performance problem you're having? jsonpath_ng doesn't parse JSON for you, but there are many faster parsers than Python's JSON module if that's your bottleneck. If you have a specific performance problem with jsonpath_ng, it would help to have more details.
I used your library to write a csv to json converter with the row headers being jpath, it worked well except for the performance. I'm probably using it wrong, but it looks like the parse is particularly expensive (I also have a lot of queries). (I used cprofiler and snakeviz to display this) tmp.prof.zip
It looks to be coming from this:
def parse_token_stream(self, token_iterator, start_symbol='jsonpath'):
# Since PLY has some crufty aspects and dumps files, we try to keep them local
# However, we need to derive the name of the output Python file :-/
output_directory = os.path.dirname(__file__)
try:
module_name = os.path.splitext(os.path.split(__file__)[1])[0]
except:
module_name = __name__
parsing_table_module = '_'.join([module_name, start_symbol, 'parsetab'])
# And we regenerate the parse table every time;
# it doesn't actually take that long!
new_parser = ply.yacc.yacc(module=self,
debug=self.debug,
tabmodule = parsing_table_module,
outputdir = output_directory,
write_tables=0,
start = start_symbol,
errorlog = logger)
return new_parser.parse(lexer = IteratorToTokenStream(token_iterator))
I did more profiling to see if I had specific expensive queries, but in fact, I'm doing 80 path queries, and each of them is taking about: ~34 ms
But in total that ends up being ~2765.30ms
@jpetersen23 Can you post the code that's giving you performance problems?
I cant share my actual code or data, but I made a toy example from my data/code.
from jsonpath_ng.ext import parse
import time
pairs = [
("$.metadata.content_release_version", "taco"),
("$.id", "taco"),
("$.config.priority", "taco"),
("$.created_at", "taco"),
("$.update_at", "taco"),
("$.event_type", "taco"),
("$.event_state", "taco"),
("$.config.requires_one_of.token[0].thingy_id", "taco"),
("$.config.requires_one_of.token[0].amount", "taco"),
("$.config.asset_map.event_icon", "taco"),
("$.config.asset_map.key_art", "taco"),
("$.config.loc_map.desc.namespace", "taco"),
("$.config.loc_map.desc.key", "taco"),
("$.config.loc_map.title.namespace", "taco"),
("$.config.loc_map.title.key", "taco"),
("$.config.loc_map.something_desc.namespace", "taco"),
("$.config.loc_map.something_desc.key", "taco"),
("$.config.challenges.BANANAS_01.event_progress", "taco"),
("$.config.challenges.BANANAS_02.event_progress", "taco"),
("$.config.challenges.BANANAS_03.event_progress", "taco"),
("$.config.challenges.BANANAS_04.event_progress", "taco"),
("$.config.challenges.BANANAS_05.event_progress", "taco"),
("$.config.challenges.BANANAS_06.event_progress", "taco"),
("$.config.challenges.BANANAS_07.event_progress", "taco"),
("$.config.challenges.BANANAS_08.event_progress", "taco"),
("$.config.challenges.BANANAS_09.event_progress", "taco"),
("$.config.challenges.BANANAS_10.event_progress", "taco"),
("$.config.challenges.BANANAS_11.event_progress", "taco"),
("$.config.challenges.BANANAS_12.event_progress", "taco"),
("$.config.challenges.BANANAS_13.event_progress", "taco"),
("$.config.challenges.BANANAS_14.event_progress", "taco"),
("$.config.challenges.BANANAS_15.event_progress", "taco"),
("$.config.challenges.BANANAS_16.event_progress", "taco"),
("$.config.challenges.BANANAS_17.event_progress", "taco"),
("$.config.challenges.BANANAS_18.event_progress", "taco"),
("$.config.challenges.BANANAS_19.event_progress", "taco"),
("$.config.challenges.BANANAS_20.event_progress", "taco"),
("$.config.challenges.BANANAS_01.auto_assign", "taco"),
("$.config.challenges.BANANAS_02.auto_assign", "taco"),
("$.config.challenges.BANANAS_03.auto_assign", "taco"),
("$.config.challenges.BANANAS_04.auto_assign", "taco"),
("$.config.challenges.BANANAS_05.auto_assign", "taco"),
("$.config.challenges.BANANAS_06.auto_assign", "taco"),
("$.config.challenges.BANANAS_07.auto_assign", "taco"),
("$.config.challenges.BANANAS_08.auto_assign", "taco"),
("$.config.challenges.BANANAS_09.auto_assign", "taco"),
("$.config.challenges.BANANAS_10.auto_assign", "taco"),
("$.config.challenges.BANANAS_11.auto_assign", "taco"),
("$.config.challenges.BANANAS_12.auto_assign", "taco"),
("$.config.challenges.BANANAS_13.auto_assign", "taco"),
("$.config.challenges.BANANAS_14.auto_assign", "taco"),
("$.config.challenges.BANANAS_15.auto_assign", "taco"),
("$.config.challenges.BANANAS_16.auto_assign", "taco"),
("$.config.challenges.BANANAS_17.auto_assign", "taco"),
("$.config.challenges.BANANAS_18.auto_assign", "taco"),
("$.config.challenges.BANANAS_19.auto_assign", "taco"),
("$.config.challenges.BANANAS_20.auto_assign", "taco"),
("$.config.tiers.\"00\".threshold", "taco"),
("$.config.tiers.\"00\".array_type[0].thingy_id", "taco"),
("$.config.tiers.\"00\".array_type[0].amount", "taco"),
("$.config.tiers.\"01\"", "taco"),
("$.config.tiers.\"02\"", "taco"),
("$.config.tiers.\"03\"", "taco"),
("$.config.tiers.\"04\"", "taco"),
("$.config.tiers.\"05\"", "taco"),
("$.config.tiers.\"06\"", "taco"),
("$.config.tiers.\"07\"", "taco"),
("$.config.tiers.\"08\"", "taco"),
("$.config.tiers.\"09\"", "taco"),
("$.config.tiers.\"10\"", "taco"),
("$.config.tiers.\"11\"", "taco"),
("$.config.tiers.\"12\"", "taco"),
("$.config.tiers.\"13\"", "taco"),
("$.config.tiers.\"14\"", "taco"),
("$.config.tiers.\"15\"", "taco"),
("$.config.tiers.\"16\"", "taco"),
("$.config.tiers.\"17\"", "taco"),
("$.config.tiers.\"18\"", "taco"),
("$.config.tiers.\"19\"", "taco")
]
json_output = {}
parse_total_time = 0
start_time = time.process_time()
for pair in pairs:
parse_start_time = time.process_time()
jsonpath_expr = parse(pair[0])
duration = 1000 * (time.process_time() - parse_start_time)
parse_total_time += duration
jsonpath_expr.update_or_create(json_output, pair[1])
total_time = 1000 * (time.process_time() - start_time)
print(f"Parse Time: {parse_total_time}ms. Total Time: {total_time}ms")
Here is cprof output for a run of it: tmp.prof.zip
Parse Time: 2665.3660000000023ms. Total Time: 2672.836ms
I also made a follow up test comparing it to a python jq setup. jpath_toy_example_with_jq.zip
The python jq version produced equivalent json, with the following times: Parse Time: 133.58500000000004ms. Total Time: 142.75500000000002ms
Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from:
Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms
down to:
Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms
(using the posted example code)
@lukasjesche That should work. Although it does also require slight code changes to the example. Instead of calling the parse
function each time, you should import ExtentedJsonPathParser
(not a typo, the class is unfortunately misspelled). Try this on the reuse-parse-table
branch. For me it gives over 20x speedup on the example.
Wow, great findings here. Having read this thread, I decided to start caching my parsers where applicable and went down from 14 minutes processing time to 7 seconds.
Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from:
Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms
down to:Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms
(using the posted example code)
Hi @lukasjesche, I’m facing a similar issue related to performance, could you please post the refactored you did in this example?
@evert061 I just extended the JsonPathParser Class like in this commit: https://github.com/h2non/jsonpath-ng/commit/0e20f3dfd433f081f77cbc952300776ccafb4923
Is there any way to improve performance/cache responses to make it faster to parse and query large json files?