open-contracting / lib-cove-oc4ids

A data review library for the Open Contracting for Infrastructure Data Standards (OC4IDS)
Other
1 stars 0 forks source link

Performance analysis #23

Closed jpmckinney closed 4 years ago

jpmckinney commented 4 years ago

To address issues from CRM-6536. data.json is attached to that issue.

I ran:

pip install memory_profiler matplotlib
time mprof run libcoveoc4ids data.json
mprof plot

time's output is:

Executed in   55.89 secs   fish           external 
   usr time   43.71 secs  791.77 millis   42.92 secs 
   sys time    3.10 secs  373.29 millis    2.73 secs 

mprof plot output:

Screen Shot 2020-10-02 at 12 06 18 PM

Going to use cProfile to identify methods to add the @profile decorator to:

python -m cProfile -o code.prof libcoveoc4ids/cli/__main__.py -o /dev/null data.json
gprof2dot -f pstats code.prof | dot -Tpng -o output.png
open output.png

As documented at https://ocp-software-handbook.readthedocs.io/en/latest/python/performance.html

jpmckinney commented 4 years ago

After #24, most time is spent in common_checks_context from lib-cove, so not an issue for this repo to fix.

output

jpmckinney commented 4 years ago

Just loading 50MB JSON takes a lot of memory:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   12.109 MiB   12.109 MiB   @profile
     4                             def main():
     5   12.109 MiB    0.000 MiB       with open("data.json") as f:
     6  217.312 MiB  205.203 MiB           json.load(f)
jpmckinney commented 4 years ago

python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json

Filename: libcoveoc4ids/cli/__main__.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   53.062 MiB   53.062 MiB   @profile
     9                             def main():
    10   53.062 MiB    0.000 MiB       parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
    11   53.062 MiB    0.000 MiB       parser.add_argument("filename")
    12   53.062 MiB    0.000 MiB       parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
    13                             
    14   53.070 MiB    0.008 MiB       args = parser.parse_args()
    15                             
    16   53.074 MiB    0.004 MiB       cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
    17   53.074 MiB    0.000 MiB       try:
    18   53.074 MiB    0.000 MiB           result = oc4ids_json_output(
    19   53.074 MiB    0.000 MiB               cove_temp_folder,
    20   53.074 MiB    0.000 MiB               args.filename,
    21  326.312 MiB  273.238 MiB               file_type='json'
    22                                     )
    23                                 finally:
    24  326.316 MiB    0.004 MiB           shutil.rmtree(cove_temp_folder)
    25                             
    26  326.316 MiB    0.000 MiB       kwargs = {}
    27  326.316 MiB    0.000 MiB       if not args.compact:
    28                                     if using_orjson:
    29                                         kwargs['option'] = jsonlib.OPT_INDENT_2
    30                                     else:
    31                                         kwargs['indent'] = 2
    32                             
    33  380.668 MiB   54.352 MiB       output = jsonlib.dumps(result, **kwargs)
    34                             
    35  380.668 MiB    0.000 MiB       if using_orjson:
    36  468.285 MiB   87.617 MiB           output = output.decode('utf-8')
    37                             
    38  518.566 MiB   50.281 MiB       print(output)
jpmckinney commented 4 years ago

Added -o option to eliminate time and memory spent printing and decoding:

output

Before (when using orjson, bytes need to be decoded to string to be printed):

Screen Shot 2020-10-02 at 3 10 43 PM

After:

Screen Shot 2020-10-02 at 3 10 11 PM

python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json

Filename: libcoveoc4ids/cli/__main__.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   53.109 MiB   53.109 MiB   @profile
     9                             def main():
    10   53.113 MiB    0.004 MiB       parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
    11   53.113 MiB    0.000 MiB       parser.add_argument("filename")
    12   53.113 MiB    0.000 MiB       parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
    13   53.113 MiB    0.000 MiB       parser.add_argument("-o", "--output", help="write output to the given file instead of standard output")
    14                             
    15   53.121 MiB    0.008 MiB       args = parser.parse_args()
    16                             
    17   53.121 MiB    0.000 MiB       cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
    18   53.121 MiB    0.000 MiB       try:
    19   53.121 MiB    0.000 MiB           result = oc4ids_json_output(
    20   53.121 MiB    0.000 MiB               cove_temp_folder,
    21   53.121 MiB    0.000 MiB               args.filename,
    22  320.234 MiB  267.113 MiB               file_type='json'
    23                                     )
    24                                 finally:
    25  320.305 MiB    0.070 MiB           shutil.rmtree(cove_temp_folder)
    26                             
    27  320.305 MiB    0.000 MiB       kwargs = {}
    28                             
    29  320.305 MiB    0.000 MiB       if using_orjson:
    30  320.305 MiB    0.000 MiB           kwargs['option'] = 0
    31  320.305 MiB    0.000 MiB           if not args.compact:
    32                                         kwargs['option'] |= jsonlib.OPT_INDENT_2
    33  320.305 MiB    0.000 MiB           if args.output:
    34  320.305 MiB    0.000 MiB               kwargs['option'] |= jsonlib.OPT_APPEND_NEWLINE
    35                             
    36  375.785 MiB   55.480 MiB           output = jsonlib.dumps(result, **kwargs)
    37  375.785 MiB    0.000 MiB           if args.output:
    38  375.785 MiB    0.000 MiB               with open(args.output, 'wb') as f:
    39  375.785 MiB    0.000 MiB                   f.write(output)
    40                                     else:
    41                                         print(output.decode())
jpmckinney commented 4 years ago

After https://github.com/OpenDataServices/lib-cove/pull/60, get_json_data_generic_paths goes from 10% to 3%:

output

jpmckinney commented 4 years ago

After https://github.com/OpenDataServices/lib-cove/pull/61 and https://github.com/OpenDataServices/lib-cove/pull/62, get_fields_present_with_examples takes less time:

output

jpmckinney commented 4 years ago

Now focusing on jsonschema. Fixing is_type yields:

output

jpmckinney commented 4 years ago

The script has a baseline of 53MB, jsonlib.loads adds 229MB, then common_checks_oc4ids adds the rest.

Filename: /Users/james/Sites/remote/open-contracting/lib-cove-oc4ids/libcoveoc4ids/api.py

Line #    Mem usage    Increment   Line Contents
================================================
    23   53.164 MiB   53.164 MiB   @profile
    24                             def oc4ids_json_output(output_dir, file, file_type=None, json_data=None,
    25                                                    lib_cove_oc4ids_config=None):
    26                             
    27   53.164 MiB    0.000 MiB       if not lib_cove_oc4ids_config:
    28   53.164 MiB    0.000 MiB           lib_cove_oc4ids_config = LibCoveOC4IDSConfig()
    29                             
    30   53.164 MiB    0.000 MiB       if not file_type:
    31                                     file_type = get_file_type(file)
    32   53.164 MiB    0.000 MiB       context = {"file_type": file_type}
    33                             
    34   53.164 MiB    0.000 MiB       if file_type == 'json':
    35   53.164 MiB    0.000 MiB           if not json_data:
    36   53.164 MiB    0.000 MiB               if using_orjson:
    37   53.164 MiB    0.000 MiB                   kwargs = {'mode': 'rb'}
    38                                         else:
    39                                             kwargs = {'encoding': 'utf-8'}
    40   53.164 MiB    0.000 MiB               with open(file, **kwargs) as fp:
    41   53.164 MiB    0.000 MiB                   try:
    42   53.164 MiB    0.000 MiB                       if using_orjson:
    43  282.121 MiB  228.957 MiB                           json_data = jsonlib.loads(fp.read())
    44                                                 else:
    45                                                     json_data = jsonlib.load(fp)
    46                                             except ValueError:
    47                                                 raise APIException('The file looks like invalid json')
    48                             
    49  282.125 MiB    0.004 MiB           schema_oc4ids = SchemaOC4IDS(lib_cove_oc4ids_config=lib_cove_oc4ids_config)
    50                             
    51                                 else:
    52                             
    53                                     raise Exception("JSON only for now, sorry!")
    54                             
    55  282.125 MiB    0.000 MiB       context = common_checks_oc4ids(
    56  282.125 MiB    0.000 MiB           context,
    57  282.125 MiB    0.000 MiB           output_dir,
    58  282.125 MiB    0.000 MiB           json_data,
    59  282.125 MiB    0.000 MiB           schema_oc4ids,
    60  327.008 MiB  327.008 MiB           lib_cove_oc4ids_config=lib_cove_oc4ids_config)
    61                             
    62  327.008 MiB    0.000 MiB       return context
Filename: /Users/james/Sites/remote/opendataservices/lib-cove/libcove/lib/common.py

Line #    Mem usage    Increment   Line Contents
================================================
   637  285.719 MiB  285.719 MiB   @profile
   638                             def get_schema_validation_errors(
   639                                 json_data,
   640                                 schema_obj,
   641                                 schema_name,
   642                                 cell_src_map,
   643                                 heading_src_map,
   644                                 extra_checkers=None,
   645                             ):
   646  285.719 MiB    0.000 MiB       pkg_schema_obj = schema_obj.get_pkg_schema_obj()
   647                             
   648  285.719 MiB    0.000 MiB       validation_errors = collections.defaultdict(list)
   649  285.719 MiB    0.000 MiB       format_checker = FormatChecker()
   650  285.719 MiB    0.000 MiB       if extra_checkers:
   651                                     format_checker.checkers.update(extra_checkers)
   652                             
   653  285.719 MiB    0.000 MiB       if getattr(schema_obj, "extended", None):
   654                                     resolver = CustomRefResolver(
   655                                         "",
   656                                         pkg_schema_obj,
   657                                         config=schema_obj.config,
   658                                         schema_url=schema_obj.schema_host,
   659                                         schema_file=schema_obj.extended_schema_file,
   660                                         file_schema_name=schema_obj.schema_name,
   661                                     )
   662                                 else:
   663  285.719 MiB    0.000 MiB           resolver = CustomRefResolver(
   664  285.719 MiB    0.000 MiB               "",
   665  285.719 MiB    0.000 MiB               pkg_schema_obj,
   666  285.719 MiB    0.000 MiB               config=schema_obj.config,
   667  285.719 MiB    0.000 MiB               schema_url=schema_obj.schema_host,
   668                                     )
   669                             
   670  285.719 MiB    0.000 MiB       our_validator = validator(
   671  285.719 MiB    0.000 MiB           pkg_schema_obj, format_checker=format_checker, resolver=resolver
   672                                 )
   673  308.219 MiB    0.902 MiB       for e in our_validator.iter_errors(json_data):
   674  308.219 MiB    0.004 MiB           message_safe = None
   675  308.219 MiB    0.000 MiB           message = e.message
   676  308.219 MiB    0.004 MiB           path = "/".join(str(item) for item in e.path)
   677  308.219 MiB    0.000 MiB           path_no_number = "/".join(
   678  308.219 MiB    0.004 MiB               str(item) for item in e.path if not isinstance(item, int)
   679                                     )
   680                             
   681  308.219 MiB    0.000 MiB           value = {"path": path}
   682  308.219 MiB    0.000 MiB           cell_reference = cell_src_map.get(path)
   683                             
   684  308.219 MiB    0.000 MiB           if cell_reference:
   685                                         first_reference = cell_reference[0]
   686                                         if len(first_reference) == 4:
   687                                             (
   688                                                 value["sheet"],
   689                                                 value["col_alpha"],
   690                                                 value["row_number"],
   691                                                 value["header"],
   692                                             ) = first_reference
   693                                         if len(first_reference) == 2:
   694                                             value["sheet"], value["row_number"] = first_reference
   695                             
   696  308.219 MiB    0.000 MiB           header = value.get("header")
   697                             
   698  308.219 MiB    0.000 MiB           header_extra = None
   699  308.219 MiB    0.000 MiB           pre_header = ""
   700                             
   701  308.219 MiB    0.000 MiB           if not header and len(e.path):
   702  308.219 MiB    0.000 MiB               header = e.path[-1]
   703  308.219 MiB    0.000 MiB               if isinstance(e.path[-1], int) and len(e.path) >= 2:
   704                                             # We're dealing with elements in an array of items at this point
   705  300.898 MiB    0.000 MiB                   pre_header = "Array Element "
   706  300.898 MiB    0.004 MiB                   header_extra = "{}/[number]".format(e.path[-2])
   707                             
   708  308.219 MiB    0.000 MiB           null_clause = ""
   709  308.219 MiB    0.000 MiB           validator_type = e.validator
   710  308.219 MiB    0.000 MiB           if e.validator in ("format", "type"):
   711  308.219 MiB    0.000 MiB               validator_type = e.validator_value
   712  308.219 MiB    0.000 MiB               if isinstance(e.validator_value, list):
   713  308.219 MiB    0.000 MiB                   validator_type = e.validator_value[0]
   714  308.219 MiB    0.000 MiB                   if "null" not in e.validator_value:
   715  308.219 MiB    0.000 MiB                       null_clause = "is not null, and"
   716                                         else:
   717  308.219 MiB    0.000 MiB                   null_clause = "is not null, and"
   718                             
   719  308.219 MiB    0.000 MiB               message_template = validation_error_template_lookup.get(
   720  308.219 MiB    0.000 MiB                   validator_type, message
   721                                         )
   722  308.219 MiB    0.000 MiB               message_safe_template = validation_error_template_lookup_safe.get(
   723  308.219 MiB    0.000 MiB                   validator_type
   724                                         )
   725                             
   726  308.219 MiB    0.000 MiB               if message_template:
   727  308.219 MiB    0.004 MiB                   message = message_template.format(pre_header, header, null_clause)
   728                             
   729  308.219 MiB    0.000 MiB               if message_safe_template:
   730  308.219 MiB    0.000 MiB                   message_safe = format_html(
   731  308.219 MiB    0.008 MiB                       message_safe_template, pre_header, header, null_clause
   732                                             )
   733                             
   734  308.219 MiB    0.004 MiB           if e.validator == "oneOf" and e.validator_value[0] == {"format": "date-time"}:
   735                                         # Give a nice date related error message for 360Giving date `oneOf`s.
   736                                         message = validation_error_template_lookup["date-time"]
   737                                         message_safe = format_html(
   738                                             validation_error_template_lookup_safe["date-time"]
   739                                         )
   740                                         validator_type = "date-time"
   741                             
   742  308.219 MiB    0.000 MiB           if not isinstance(e.instance, (dict, list)):
   743  308.219 MiB    0.000 MiB               value["value"] = e.instance
   744                             
   745  308.219 MiB    0.000 MiB           if e.validator == "required":
   746  300.898 MiB    0.000 MiB               field_name = e.message
   747  300.898 MiB    0.000 MiB               parent_name = None
   748  300.898 MiB    0.000 MiB               if len(e.path) > 2:
   749  300.898 MiB    0.000 MiB                   if isinstance(e.path[-1], int):
   750  300.898 MiB    0.000 MiB                       parent_name = e.path[-2]
   751                                             else:
   752                                                 parent_name = e.path[-1]
   753                             
   754  300.898 MiB    0.000 MiB               heading = heading_src_map.get(path_no_number + "/" + e.message)
   755  300.898 MiB    0.000 MiB               if heading:
   756                                             field_name = heading[0][1]
   757                                             value["header"] = heading[0][1]
   758  300.898 MiB    0.000 MiB               header = field_name
   759  300.898 MiB    0.000 MiB               if parent_name:
   760  300.898 MiB    0.000 MiB                   message = "'{}' is missing but required within '{}'".format(
   761  300.898 MiB    0.004 MiB                       field_name, parent_name
   762                                             )
   763  300.898 MiB    0.004 MiB                   message_safe = format_html(
   764  300.898 MiB    0.000 MiB                       "<code>{}</code> is missing but required within <code>{}</code>",
   765  300.898 MiB    0.000 MiB                       field_name,
   766  300.898 MiB    0.008 MiB                       parent_name,
   767                                             )
   768                                         else:
   769                                             message = "'{}' is missing but required".format(field_name)
   770                                             message_safe = format_html(
   771                                                 "<code>{}</code> is missing but required", field_name, parent_name
   772                                             )
   773                             
   774  308.219 MiB    0.000 MiB           if e.validator == "enum":
   775  307.074 MiB    0.000 MiB               if "isCodelist" in e.schema:
   776                                             continue
   777  307.074 MiB    0.000 MiB               message = "Invalid code found in '{}'".format(header)
   778  307.074 MiB    0.004 MiB               message_safe = format_html("Invalid code found in <code>{}</code>", header)
   779                             
   780  308.219 MiB    0.000 MiB           if e.validator == "pattern":
   781                                         message_safe = format_html(
   782                                             "<code>{}</code> does not match the regex <code>{}</code>",
   783                                             header,
   784                                             e.validator_value,
   785                                         )
   786                             
   787  308.219 MiB    0.000 MiB           if e.validator == "minItems" and e.validator_value == 1:
   788                                         message_safe = format_html(
   789                                             "<code>{}</code> is too short. You must supply at least one value, or remove the item entirely (unless it’s required).",
   790                                             e.instance,
   791                                         )
   792                             
   793  308.219 MiB    0.000 MiB           if e.validator == "minLength" and e.validator_value == 1:
   794                                         message_safe = format_html(
   795                                             '<code>"{}"</code> is too short. Strings must be at least one character. This error typically indicates a missing value.',
   796                                             e.instance,
   797                                         )
   798                             
   799  308.219 MiB    0.000 MiB           if message_safe is None:
   800  308.219 MiB    0.000 MiB               message_safe = escape(message)
   801                             
   802  308.219 MiB    0.000 MiB           if header_extra is None:
   803  308.219 MiB    0.000 MiB               header_extra = header
   804                             
   805                                     unique_validator_key = {
   806  308.219 MiB    0.000 MiB               "message": message,
   807  308.219 MiB    0.000 MiB               "message_safe": conditional_escape(message_safe),
   808  308.219 MiB    0.000 MiB               "validator": e.validator,
   809  308.219 MiB    0.000 MiB               "assumption": e.assumption if hasattr(e, "assumption") else None,
   810                                         # Don't pass this value for 'enum' and 'required' validators,
   811                                         # because it is not needed, and it will mean less grouping, which
   812                                         # we don't want.
   813                                         "validator_value": e.validator_value
   814  308.219 MiB    0.000 MiB               if e.validator not in ["enum", "required"]
   815  307.074 MiB    0.000 MiB               else None,
   816  308.219 MiB    0.000 MiB               "message_type": validator_type,
   817  308.219 MiB    0.000 MiB               "path_no_number": path_no_number,
   818  308.219 MiB    0.000 MiB               "header": header,
   819  308.219 MiB    0.000 MiB               "header_extra": header_extra,
   820  308.219 MiB    0.000 MiB               "null_clause": null_clause,
   821  308.219 MiB    0.000 MiB               "error_id": e.error_id if hasattr(e, "error_id") else None,
   822                                     }
   823  308.219 MiB    0.031 MiB           validation_errors[json.dumps(unique_validator_key)].append(value)
   824  308.219 MiB    0.000 MiB       return dict(validation_errors)
jpmckinney commented 4 years ago
time python libcoveoc4ids/cli/__main__.py data.json > /dev/null

Got time down to:

Executed in   26.67 secs   fish           external 
   usr time   23.59 secs  227.00 micros   23.59 secs 
   sys time    0.81 secs  1608.00 micros    0.81 secs 

Originally:

________________________________________________________
Executed in   43.23 secs   fish           external 
   usr time   39.75 secs  211.00 micros   39.75 secs 
   sys time    1.10 secs  1505.00 micros    1.10 secs 
jpmckinney commented 4 years ago

Closing this issue. Memory usage was halved, and running time is two-thirds faster. In an above comment, oc4ids_json_output used 327 MB, having started with 53 MB and having used 229 MB just to load the JSON. That leaves 45 MB for lib-cove. This is much better in comparison to the memory for the JSON data.

PRs are open in https://github.com/OpenDataServices/lib-cove/pulls/jpmckinney

Other follow-up issues are:

jpmckinney commented 4 years ago

Related earlier issue: https://github.com/OpenDataServices/cove/issues/579, which links to this document: https://docs.google.com/document/d/1DwFGVLjqTD3yfWg-7zKGsl4X5L2B2R7yAJkGCyoBLUA/edit