h2non / jsonpath-ng

Finally, a JSONPath implementation for Python that aims to be standard compliant. That's all. Enjoy!
Apache License 2.0
593 stars 86 forks source link

Performance Improvement #144

Open isuhail-sray opened 11 months ago

isuhail-sray commented 11 months ago

Is there any way to improve performance/cache responses to make it faster to parse and query large json files?

michaelmior commented 10 months ago

Can you give an example of a specific performance problem you're having? jsonpath_ng doesn't parse JSON for you, but there are many faster parsers than Python's JSON module if that's your bottleneck. If you have a specific performance problem with jsonpath_ng, it would help to have more details.

jpetersen23 commented 9 months ago

I used your library to write a csv to json converter with the row headers being jpath, it worked well except for the performance. I'm probably using it wrong, but it looks like the parse is particularly expensive (I also have a lot of queries). (I used cprofiler and snakeviz to display this) tmp.prof.zip

Screenshot 2023-12-12 at 9 29 56 PM
jpetersen23 commented 9 months ago

It looks to be coming from this:

def parse_token_stream(self, token_iterator, start_symbol='jsonpath'):

    # Since PLY has some crufty aspects and dumps files, we try to keep them local
    # However, we need to derive the name of the output Python file :-/
    output_directory = os.path.dirname(__file__)
    try:
        module_name = os.path.splitext(os.path.split(__file__)[1])[0]
    except:
        module_name = __name__

    parsing_table_module = '_'.join([module_name, start_symbol, 'parsetab'])

    # And we regenerate the parse table every time;
    # it doesn't actually take that long!
    new_parser = ply.yacc.yacc(module=self,
                               debug=self.debug,
                               tabmodule = parsing_table_module,
                               outputdir = output_directory,
                               write_tables=0,
                               start = start_symbol,
                               errorlog = logger)

    return new_parser.parse(lexer = IteratorToTokenStream(token_iterator))
Screenshot 2023-12-12 at 9 41 07 PM
jpetersen23 commented 9 months ago

I did more profiling to see if I had specific expensive queries, but in fact, I'm doing 80 path queries, and each of them is taking about: ~34 ms

But in total that ends up being ~2765.30ms

michaelmior commented 9 months ago

@jpetersen23 Can you post the code that's giving you performance problems?

jpetersen23 commented 9 months ago

I cant share my actual code or data, but I made a toy example from my data/code.

from jsonpath_ng.ext import parse
import time

pairs = [
    ("$.metadata.content_release_version", "taco"),
    ("$.id", "taco"),
    ("$.config.priority", "taco"),
    ("$.created_at", "taco"),
    ("$.update_at", "taco"),
    ("$.event_type", "taco"),
    ("$.event_state", "taco"),
    ("$.config.requires_one_of.token[0].thingy_id", "taco"),
    ("$.config.requires_one_of.token[0].amount", "taco"),
    ("$.config.asset_map.event_icon", "taco"),
    ("$.config.asset_map.key_art", "taco"),
    ("$.config.loc_map.desc.namespace", "taco"),
    ("$.config.loc_map.desc.key", "taco"),
    ("$.config.loc_map.title.namespace", "taco"),
    ("$.config.loc_map.title.key", "taco"),
    ("$.config.loc_map.something_desc.namespace", "taco"),
    ("$.config.loc_map.something_desc.key", "taco"),
    ("$.config.challenges.BANANAS_01.event_progress", "taco"),
    ("$.config.challenges.BANANAS_02.event_progress", "taco"),
    ("$.config.challenges.BANANAS_03.event_progress", "taco"),
    ("$.config.challenges.BANANAS_04.event_progress", "taco"),
    ("$.config.challenges.BANANAS_05.event_progress", "taco"),
    ("$.config.challenges.BANANAS_06.event_progress", "taco"),
    ("$.config.challenges.BANANAS_07.event_progress", "taco"),
    ("$.config.challenges.BANANAS_08.event_progress", "taco"),
    ("$.config.challenges.BANANAS_09.event_progress", "taco"),
    ("$.config.challenges.BANANAS_10.event_progress", "taco"),
    ("$.config.challenges.BANANAS_11.event_progress", "taco"),
    ("$.config.challenges.BANANAS_12.event_progress", "taco"),
    ("$.config.challenges.BANANAS_13.event_progress", "taco"),
    ("$.config.challenges.BANANAS_14.event_progress", "taco"),
    ("$.config.challenges.BANANAS_15.event_progress", "taco"),
    ("$.config.challenges.BANANAS_16.event_progress", "taco"),
    ("$.config.challenges.BANANAS_17.event_progress", "taco"),
    ("$.config.challenges.BANANAS_18.event_progress", "taco"),
    ("$.config.challenges.BANANAS_19.event_progress", "taco"),
    ("$.config.challenges.BANANAS_20.event_progress", "taco"),
    ("$.config.challenges.BANANAS_01.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_02.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_03.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_04.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_05.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_06.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_07.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_08.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_09.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_10.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_11.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_12.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_13.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_14.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_15.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_16.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_17.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_18.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_19.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_20.auto_assign", "taco"),
    ("$.config.tiers.\"00\".threshold", "taco"),
    ("$.config.tiers.\"00\".array_type[0].thingy_id", "taco"),
    ("$.config.tiers.\"00\".array_type[0].amount", "taco"),
    ("$.config.tiers.\"01\"", "taco"),
    ("$.config.tiers.\"02\"", "taco"),
    ("$.config.tiers.\"03\"", "taco"),
    ("$.config.tiers.\"04\"", "taco"),
    ("$.config.tiers.\"05\"", "taco"),
    ("$.config.tiers.\"06\"", "taco"),
    ("$.config.tiers.\"07\"", "taco"),
    ("$.config.tiers.\"08\"", "taco"),
    ("$.config.tiers.\"09\"", "taco"),
    ("$.config.tiers.\"10\"", "taco"),
    ("$.config.tiers.\"11\"", "taco"),
    ("$.config.tiers.\"12\"", "taco"),
    ("$.config.tiers.\"13\"", "taco"),
    ("$.config.tiers.\"14\"", "taco"),
    ("$.config.tiers.\"15\"", "taco"),
    ("$.config.tiers.\"16\"", "taco"),
    ("$.config.tiers.\"17\"", "taco"),
    ("$.config.tiers.\"18\"", "taco"),
    ("$.config.tiers.\"19\"", "taco")
]

json_output = {}
parse_total_time = 0
start_time = time.process_time()
for pair in pairs:
    parse_start_time = time.process_time()
    jsonpath_expr = parse(pair[0])
    duration = 1000 * (time.process_time() - parse_start_time)
    parse_total_time += duration

    jsonpath_expr.update_or_create(json_output, pair[1])

total_time = 1000 * (time.process_time() - start_time)

print(f"Parse Time: {parse_total_time}ms. Total Time: {total_time}ms")

Here is cprof output for a run of it: tmp.prof.zip

Parse Time: 2665.3660000000023ms. Total Time: 2672.836ms

I also made a follow up test comparing it to a python jq setup. jpath_toy_example_with_jq.zip

The python jq version produced equivalent json, with the following times: Parse Time: 133.58500000000004ms. Total Time: 142.75500000000002ms

lukasjesche commented 9 months ago

Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from: Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms down to: Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms (using the posted example code)

michaelmior commented 7 months ago

@lukasjesche That should work. Although it does also require slight code changes to the example. Instead of calling the parse function each time, you should import ExtentedJsonPathParser (not a typo, the class is unfortunately misspelled). Try this on the reuse-parse-table branch. For me it gives over 20x speedup on the example.

martkopecky commented 7 months ago

Wow, great findings here. Having read this thread, I decided to start caching my parsers where applicable and went down from 14 minutes processing time to 7 seconds.

evert061 commented 6 months ago

Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from: Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms down to: Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms (using the posted example code)

Hi @lukasjesche, I’m facing a similar issue related to performance, could you please post the refactored you did in this example?

lukasjesche commented 6 months ago

@evert061 I just extended the JsonPathParser Class like in this commit: https://github.com/h2non/jsonpath-ng/commit/0e20f3dfd433f081f77cbc952300776ccafb4923