Closed jpmckinney closed 4 years ago
After #24, most time is spent in common_checks_context from lib-cove, so not an issue for this repo to fix.
Just loading 50MB JSON takes a lot of memory:
Filename: test.py
Line # Mem usage Increment Line Contents
================================================
3 12.109 MiB 12.109 MiB @profile
4 def main():
5 12.109 MiB 0.000 MiB with open("data.json") as f:
6 217.312 MiB 205.203 MiB json.load(f)
python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json
Filename: libcoveoc4ids/cli/__main__.py
Line # Mem usage Increment Line Contents
================================================
8 53.062 MiB 53.062 MiB @profile
9 def main():
10 53.062 MiB 0.000 MiB parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
11 53.062 MiB 0.000 MiB parser.add_argument("filename")
12 53.062 MiB 0.000 MiB parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
13
14 53.070 MiB 0.008 MiB args = parser.parse_args()
15
16 53.074 MiB 0.004 MiB cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
17 53.074 MiB 0.000 MiB try:
18 53.074 MiB 0.000 MiB result = oc4ids_json_output(
19 53.074 MiB 0.000 MiB cove_temp_folder,
20 53.074 MiB 0.000 MiB args.filename,
21 326.312 MiB 273.238 MiB file_type='json'
22 )
23 finally:
24 326.316 MiB 0.004 MiB shutil.rmtree(cove_temp_folder)
25
26 326.316 MiB 0.000 MiB kwargs = {}
27 326.316 MiB 0.000 MiB if not args.compact:
28 if using_orjson:
29 kwargs['option'] = jsonlib.OPT_INDENT_2
30 else:
31 kwargs['indent'] = 2
32
33 380.668 MiB 54.352 MiB output = jsonlib.dumps(result, **kwargs)
34
35 380.668 MiB 0.000 MiB if using_orjson:
36 468.285 MiB 87.617 MiB output = output.decode('utf-8')
37
38 518.566 MiB 50.281 MiB print(output)
Added -o
option to eliminate time and memory spent print
ing and decoding:
Before (when using orjson, bytes need to be decoded to string to be printed):
After:
python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json
Filename: libcoveoc4ids/cli/__main__.py
Line # Mem usage Increment Line Contents
================================================
8 53.109 MiB 53.109 MiB @profile
9 def main():
10 53.113 MiB 0.004 MiB parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
11 53.113 MiB 0.000 MiB parser.add_argument("filename")
12 53.113 MiB 0.000 MiB parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
13 53.113 MiB 0.000 MiB parser.add_argument("-o", "--output", help="write output to the given file instead of standard output")
14
15 53.121 MiB 0.008 MiB args = parser.parse_args()
16
17 53.121 MiB 0.000 MiB cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
18 53.121 MiB 0.000 MiB try:
19 53.121 MiB 0.000 MiB result = oc4ids_json_output(
20 53.121 MiB 0.000 MiB cove_temp_folder,
21 53.121 MiB 0.000 MiB args.filename,
22 320.234 MiB 267.113 MiB file_type='json'
23 )
24 finally:
25 320.305 MiB 0.070 MiB shutil.rmtree(cove_temp_folder)
26
27 320.305 MiB 0.000 MiB kwargs = {}
28
29 320.305 MiB 0.000 MiB if using_orjson:
30 320.305 MiB 0.000 MiB kwargs['option'] = 0
31 320.305 MiB 0.000 MiB if not args.compact:
32 kwargs['option'] |= jsonlib.OPT_INDENT_2
33 320.305 MiB 0.000 MiB if args.output:
34 320.305 MiB 0.000 MiB kwargs['option'] |= jsonlib.OPT_APPEND_NEWLINE
35
36 375.785 MiB 55.480 MiB output = jsonlib.dumps(result, **kwargs)
37 375.785 MiB 0.000 MiB if args.output:
38 375.785 MiB 0.000 MiB with open(args.output, 'wb') as f:
39 375.785 MiB 0.000 MiB f.write(output)
40 else:
41 print(output.decode())
After https://github.com/OpenDataServices/lib-cove/pull/60, get_json_data_generic_paths
goes from 10% to 3%:
After https://github.com/OpenDataServices/lib-cove/pull/61 and https://github.com/OpenDataServices/lib-cove/pull/62, get_fields_present_with_examples
takes less time:
Now focusing on jsonschema. Fixing is_type
yields:
The script has a baseline of 53MB, jsonlib.loads
adds 229MB, then common_checks_oc4ids
adds the rest.
Filename: /Users/james/Sites/remote/open-contracting/lib-cove-oc4ids/libcoveoc4ids/api.py
Line # Mem usage Increment Line Contents
================================================
23 53.164 MiB 53.164 MiB @profile
24 def oc4ids_json_output(output_dir, file, file_type=None, json_data=None,
25 lib_cove_oc4ids_config=None):
26
27 53.164 MiB 0.000 MiB if not lib_cove_oc4ids_config:
28 53.164 MiB 0.000 MiB lib_cove_oc4ids_config = LibCoveOC4IDSConfig()
29
30 53.164 MiB 0.000 MiB if not file_type:
31 file_type = get_file_type(file)
32 53.164 MiB 0.000 MiB context = {"file_type": file_type}
33
34 53.164 MiB 0.000 MiB if file_type == 'json':
35 53.164 MiB 0.000 MiB if not json_data:
36 53.164 MiB 0.000 MiB if using_orjson:
37 53.164 MiB 0.000 MiB kwargs = {'mode': 'rb'}
38 else:
39 kwargs = {'encoding': 'utf-8'}
40 53.164 MiB 0.000 MiB with open(file, **kwargs) as fp:
41 53.164 MiB 0.000 MiB try:
42 53.164 MiB 0.000 MiB if using_orjson:
43 282.121 MiB 228.957 MiB json_data = jsonlib.loads(fp.read())
44 else:
45 json_data = jsonlib.load(fp)
46 except ValueError:
47 raise APIException('The file looks like invalid json')
48
49 282.125 MiB 0.004 MiB schema_oc4ids = SchemaOC4IDS(lib_cove_oc4ids_config=lib_cove_oc4ids_config)
50
51 else:
52
53 raise Exception("JSON only for now, sorry!")
54
55 282.125 MiB 0.000 MiB context = common_checks_oc4ids(
56 282.125 MiB 0.000 MiB context,
57 282.125 MiB 0.000 MiB output_dir,
58 282.125 MiB 0.000 MiB json_data,
59 282.125 MiB 0.000 MiB schema_oc4ids,
60 327.008 MiB 327.008 MiB lib_cove_oc4ids_config=lib_cove_oc4ids_config)
61
62 327.008 MiB 0.000 MiB return context
Filename: /Users/james/Sites/remote/opendataservices/lib-cove/libcove/lib/common.py
Line # Mem usage Increment Line Contents
================================================
637 285.719 MiB 285.719 MiB @profile
638 def get_schema_validation_errors(
639 json_data,
640 schema_obj,
641 schema_name,
642 cell_src_map,
643 heading_src_map,
644 extra_checkers=None,
645 ):
646 285.719 MiB 0.000 MiB pkg_schema_obj = schema_obj.get_pkg_schema_obj()
647
648 285.719 MiB 0.000 MiB validation_errors = collections.defaultdict(list)
649 285.719 MiB 0.000 MiB format_checker = FormatChecker()
650 285.719 MiB 0.000 MiB if extra_checkers:
651 format_checker.checkers.update(extra_checkers)
652
653 285.719 MiB 0.000 MiB if getattr(schema_obj, "extended", None):
654 resolver = CustomRefResolver(
655 "",
656 pkg_schema_obj,
657 config=schema_obj.config,
658 schema_url=schema_obj.schema_host,
659 schema_file=schema_obj.extended_schema_file,
660 file_schema_name=schema_obj.schema_name,
661 )
662 else:
663 285.719 MiB 0.000 MiB resolver = CustomRefResolver(
664 285.719 MiB 0.000 MiB "",
665 285.719 MiB 0.000 MiB pkg_schema_obj,
666 285.719 MiB 0.000 MiB config=schema_obj.config,
667 285.719 MiB 0.000 MiB schema_url=schema_obj.schema_host,
668 )
669
670 285.719 MiB 0.000 MiB our_validator = validator(
671 285.719 MiB 0.000 MiB pkg_schema_obj, format_checker=format_checker, resolver=resolver
672 )
673 308.219 MiB 0.902 MiB for e in our_validator.iter_errors(json_data):
674 308.219 MiB 0.004 MiB message_safe = None
675 308.219 MiB 0.000 MiB message = e.message
676 308.219 MiB 0.004 MiB path = "/".join(str(item) for item in e.path)
677 308.219 MiB 0.000 MiB path_no_number = "/".join(
678 308.219 MiB 0.004 MiB str(item) for item in e.path if not isinstance(item, int)
679 )
680
681 308.219 MiB 0.000 MiB value = {"path": path}
682 308.219 MiB 0.000 MiB cell_reference = cell_src_map.get(path)
683
684 308.219 MiB 0.000 MiB if cell_reference:
685 first_reference = cell_reference[0]
686 if len(first_reference) == 4:
687 (
688 value["sheet"],
689 value["col_alpha"],
690 value["row_number"],
691 value["header"],
692 ) = first_reference
693 if len(first_reference) == 2:
694 value["sheet"], value["row_number"] = first_reference
695
696 308.219 MiB 0.000 MiB header = value.get("header")
697
698 308.219 MiB 0.000 MiB header_extra = None
699 308.219 MiB 0.000 MiB pre_header = ""
700
701 308.219 MiB 0.000 MiB if not header and len(e.path):
702 308.219 MiB 0.000 MiB header = e.path[-1]
703 308.219 MiB 0.000 MiB if isinstance(e.path[-1], int) and len(e.path) >= 2:
704 # We're dealing with elements in an array of items at this point
705 300.898 MiB 0.000 MiB pre_header = "Array Element "
706 300.898 MiB 0.004 MiB header_extra = "{}/[number]".format(e.path[-2])
707
708 308.219 MiB 0.000 MiB null_clause = ""
709 308.219 MiB 0.000 MiB validator_type = e.validator
710 308.219 MiB 0.000 MiB if e.validator in ("format", "type"):
711 308.219 MiB 0.000 MiB validator_type = e.validator_value
712 308.219 MiB 0.000 MiB if isinstance(e.validator_value, list):
713 308.219 MiB 0.000 MiB validator_type = e.validator_value[0]
714 308.219 MiB 0.000 MiB if "null" not in e.validator_value:
715 308.219 MiB 0.000 MiB null_clause = "is not null, and"
716 else:
717 308.219 MiB 0.000 MiB null_clause = "is not null, and"
718
719 308.219 MiB 0.000 MiB message_template = validation_error_template_lookup.get(
720 308.219 MiB 0.000 MiB validator_type, message
721 )
722 308.219 MiB 0.000 MiB message_safe_template = validation_error_template_lookup_safe.get(
723 308.219 MiB 0.000 MiB validator_type
724 )
725
726 308.219 MiB 0.000 MiB if message_template:
727 308.219 MiB 0.004 MiB message = message_template.format(pre_header, header, null_clause)
728
729 308.219 MiB 0.000 MiB if message_safe_template:
730 308.219 MiB 0.000 MiB message_safe = format_html(
731 308.219 MiB 0.008 MiB message_safe_template, pre_header, header, null_clause
732 )
733
734 308.219 MiB 0.004 MiB if e.validator == "oneOf" and e.validator_value[0] == {"format": "date-time"}:
735 # Give a nice date related error message for 360Giving date `oneOf`s.
736 message = validation_error_template_lookup["date-time"]
737 message_safe = format_html(
738 validation_error_template_lookup_safe["date-time"]
739 )
740 validator_type = "date-time"
741
742 308.219 MiB 0.000 MiB if not isinstance(e.instance, (dict, list)):
743 308.219 MiB 0.000 MiB value["value"] = e.instance
744
745 308.219 MiB 0.000 MiB if e.validator == "required":
746 300.898 MiB 0.000 MiB field_name = e.message
747 300.898 MiB 0.000 MiB parent_name = None
748 300.898 MiB 0.000 MiB if len(e.path) > 2:
749 300.898 MiB 0.000 MiB if isinstance(e.path[-1], int):
750 300.898 MiB 0.000 MiB parent_name = e.path[-2]
751 else:
752 parent_name = e.path[-1]
753
754 300.898 MiB 0.000 MiB heading = heading_src_map.get(path_no_number + "/" + e.message)
755 300.898 MiB 0.000 MiB if heading:
756 field_name = heading[0][1]
757 value["header"] = heading[0][1]
758 300.898 MiB 0.000 MiB header = field_name
759 300.898 MiB 0.000 MiB if parent_name:
760 300.898 MiB 0.000 MiB message = "'{}' is missing but required within '{}'".format(
761 300.898 MiB 0.004 MiB field_name, parent_name
762 )
763 300.898 MiB 0.004 MiB message_safe = format_html(
764 300.898 MiB 0.000 MiB "<code>{}</code> is missing but required within <code>{}</code>",
765 300.898 MiB 0.000 MiB field_name,
766 300.898 MiB 0.008 MiB parent_name,
767 )
768 else:
769 message = "'{}' is missing but required".format(field_name)
770 message_safe = format_html(
771 "<code>{}</code> is missing but required", field_name, parent_name
772 )
773
774 308.219 MiB 0.000 MiB if e.validator == "enum":
775 307.074 MiB 0.000 MiB if "isCodelist" in e.schema:
776 continue
777 307.074 MiB 0.000 MiB message = "Invalid code found in '{}'".format(header)
778 307.074 MiB 0.004 MiB message_safe = format_html("Invalid code found in <code>{}</code>", header)
779
780 308.219 MiB 0.000 MiB if e.validator == "pattern":
781 message_safe = format_html(
782 "<code>{}</code> does not match the regex <code>{}</code>",
783 header,
784 e.validator_value,
785 )
786
787 308.219 MiB 0.000 MiB if e.validator == "minItems" and e.validator_value == 1:
788 message_safe = format_html(
789 "<code>{}</code> is too short. You must supply at least one value, or remove the item entirely (unless it’s required).",
790 e.instance,
791 )
792
793 308.219 MiB 0.000 MiB if e.validator == "minLength" and e.validator_value == 1:
794 message_safe = format_html(
795 '<code>"{}"</code> is too short. Strings must be at least one character. This error typically indicates a missing value.',
796 e.instance,
797 )
798
799 308.219 MiB 0.000 MiB if message_safe is None:
800 308.219 MiB 0.000 MiB message_safe = escape(message)
801
802 308.219 MiB 0.000 MiB if header_extra is None:
803 308.219 MiB 0.000 MiB header_extra = header
804
805 unique_validator_key = {
806 308.219 MiB 0.000 MiB "message": message,
807 308.219 MiB 0.000 MiB "message_safe": conditional_escape(message_safe),
808 308.219 MiB 0.000 MiB "validator": e.validator,
809 308.219 MiB 0.000 MiB "assumption": e.assumption if hasattr(e, "assumption") else None,
810 # Don't pass this value for 'enum' and 'required' validators,
811 # because it is not needed, and it will mean less grouping, which
812 # we don't want.
813 "validator_value": e.validator_value
814 308.219 MiB 0.000 MiB if e.validator not in ["enum", "required"]
815 307.074 MiB 0.000 MiB else None,
816 308.219 MiB 0.000 MiB "message_type": validator_type,
817 308.219 MiB 0.000 MiB "path_no_number": path_no_number,
818 308.219 MiB 0.000 MiB "header": header,
819 308.219 MiB 0.000 MiB "header_extra": header_extra,
820 308.219 MiB 0.000 MiB "null_clause": null_clause,
821 308.219 MiB 0.000 MiB "error_id": e.error_id if hasattr(e, "error_id") else None,
822 }
823 308.219 MiB 0.031 MiB validation_errors[json.dumps(unique_validator_key)].append(value)
824 308.219 MiB 0.000 MiB return dict(validation_errors)
time python libcoveoc4ids/cli/__main__.py data.json > /dev/null
Got time down to:
Executed in 26.67 secs fish external
usr time 23.59 secs 227.00 micros 23.59 secs
sys time 0.81 secs 1608.00 micros 0.81 secs
Originally:
________________________________________________________
Executed in 43.23 secs fish external
usr time 39.75 secs 211.00 micros 39.75 secs
sys time 1.10 secs 1505.00 micros 1.10 secs
Closing this issue. Memory usage was halved, and running time is two-thirds faster. In an above comment, oc4ids_json_output
used 327 MB, having started with 53 MB and having used 229 MB just to load the JSON. That leaves 45 MB for lib-cove. This is much better in comparison to the memory for the JSON data.
PRs are open in https://github.com/OpenDataServices/lib-cove/pulls/jpmckinney
Other follow-up issues are:
Related earlier issue: https://github.com/OpenDataServices/cove/issues/579, which links to this document: https://docs.google.com/document/d/1DwFGVLjqTD3yfWg-7zKGsl4X5L2B2R7yAJkGCyoBLUA/edit
To address issues from CRM-6536.
data.json
is attached to that issue.I ran:
time
's output is:mprof plot
output:Going to use cProfile to identify methods to add the
@profile
decorator to:As documented at https://ocp-software-handbook.readthedocs.io/en/latest/python/performance.html