Closed jakelever closed 2 years ago
oh! that's not good. I will run a profile on it and see what's slowing it down so much. I can probably improve the performance, and if not I can definitely make marking/stripping the annotations completely optional at the top level
In case it makes a difference can you post your python version and the results of pip freeze?
Initial profiling results with python 3.7 confirm it is lagging in the strip annotation markers function
4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds
Ordered by: standard name
List reduced from 668 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
111/84 0.000 0.000 0.017 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
69 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:103(release)
64 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
64 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__)
64 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
69 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
64 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
78/2 0.000 0.000 0.048 0.024 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
844 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
58 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:307(__init__)
58 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:311(__enter__)
58 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:318(__exit__)
232 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
46 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:35(_new_module)
62 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:369(__init__)
104 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:403(cached)
76 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:416(parent)
58 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:424(has_location)
58 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
58/56 0.000 0.000 0.007 0.000 <frozen importlib._bootstrap>:576(module_from_spec)
64 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
58/2 0.000 0.000 0.049 0.025 <frozen importlib._bootstrap>:663(_load_unlocked)
62 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:719(find_spec)
69 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:78(acquire)
Sun Jan 30 11:22:07 2022 profile.issues10.out
4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds
Ordered by: cumulative time
List reduced from 668 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
57/1 0.000 0.000 1513.447 1513.447 {built-in method builtins.exec}
1 0.000 0.000 1513.447 1513.447 convert.py:2(<module>)
1 0.000 0.000 1513.396 1513.396 main.py:26(convert)
1 0.003 0.003 1513.394 1513.394 pmcxml.py:354(pmcxml2bioc)
217 8.797 0.041 1502.503 6.924 utils.py:255(strip_annotation_markers)
8010700 4.262 0.000 1471.568 0.000 re.py:180(search)
10901861 44.430 0.000 1453.659 0.000 re.py:271(_compile)
8010487 26.871 0.000 1362.766 0.000 sre_compile.py:759(compile)
8010487 18.001 0.000 895.223 0.000 sre_parse.py:919(parse)
8010527/8010487 14.135 0.000 855.176 0.000 sre_parse.py:417(_parse_sub)
8010559/8010493 310.134 0.000 834.118 0.000 sre_parse.py:475(_parse)
8010487 9.968 0.000 424.314 0.000 sre_compile.py:598(_code)
10013291/8010486 97.652 0.000 231.001 0.000 sre_compile.py:71(_compile)
376492023 82.130 0.000 198.495 0.000 sre_parse.py:254(get)
8010487 30.688 0.000 182.857 0.000 sre_compile.py:536(_compile_info)
392513114 124.763 0.000 124.763 0.000 sre_parse.py:233(__next)
348455987 74.736 0.000 107.907 0.000 sre_parse.py:164(__getitem__)
16021002 52.015 0.000 90.498 0.000 sre_compile.py:276(_optimize_charset)
334436635 61.936 0.000 83.225 0.000 sre_parse.py:172(append)
1322999171 77.738 0.000 77.738 0.000 {method 'append' of 'list' objects}
10013326/8010521 70.915 0.000 76.275 0.000 sre_parse.py:174(getwidth)
34044471 36.123 0.000 48.662 0.000 sre_parse.py:343(_escape)
415657710 42.946 0.000 42.946 0.000 {built-in method builtins.isinstance}
8010486 12.469 0.000 33.531 0.000 enum.py:833(__and__)
28036789 20.275 0.000 28.359 0.000 sre_parse.py:295(_class_escape)
Sun Jan 30 11:22:07 2022 profile.issues10.out
4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds
Ordered by: internal time
List reduced from 668 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
8010559/8010493 310.134 0.000 834.118 0.000 sre_parse.py:475(_parse)
392513114 124.763 0.000 124.763 0.000 sre_parse.py:233(__next)
10013291/8010486 97.652 0.000 231.001 0.000 sre_compile.py:71(_compile)
376492023 82.130 0.000 198.495 0.000 sre_parse.py:254(get)
1322999171 77.738 0.000 77.738 0.000 {method 'append' of 'list' objects}
348455987 74.736 0.000 107.907 0.000 sre_parse.py:164(__getitem__)
10013326/8010521 70.915 0.000 76.275 0.000 sre_parse.py:174(getwidth)
334436635 61.936 0.000 83.225 0.000 sre_parse.py:172(append)
16021002 52.015 0.000 90.498 0.000 sre_compile.py:276(_optimize_charset)
10901861 44.430 0.000 1453.659 0.000 re.py:271(_compile)
415657710 42.946 0.000 42.946 0.000 {built-in method builtins.isinstance}
34044471 36.123 0.000 48.662 0.000 sre_parse.py:343(_escape)
8010487 30.688 0.000 182.857 0.000 sre_compile.py:536(_compile_info)
8010487 26.871 0.000 1362.766 0.000 sre_compile.py:759(compile)
16021002 22.614 0.000 28.193 0.000 sre_compile.py:249(_compile_charset)
268481147/258467844 21.821 0.000 24.453 0.000 {built-in method builtins.len}
28036789 20.275 0.000 28.359 0.000 sre_parse.py:295(_class_escape)
334436671 19.770 0.000 19.770 0.000 {built-in method builtins.ord}
8010536 18.356 0.000 18.356 0.000 {method 'translate' of 'str' objects}
8010487 18.001 0.000 895.223 0.000 sre_parse.py:919(parse)
8010492/8010484 15.199 0.000 21.142 0.000 sre_compile.py:461(_get_literal_prefix)
8010699 15.056 0.000 15.056 0.000 {method 'search' of 're.Pattern' objects}
8010527/8010487 14.135 0.000 855.176 0.000 sre_parse.py:417(_parse_sub)
116683821 14.129 0.000 14.129 0.000 {method 'get' of 'dict' objects}
2002615 13.713 0.000 13.956 0.000 sre_compile.py:432(_generate_overlap_table)
Sun Jan 30 11:22:07 2022 profile.issues10.out
4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds
Ordered by: call count
List reduced from 668 to 25 due to restriction <25>
ncalls tottime percall cumtime percall filename:lineno(function)
1322999171 77.738 0.000 77.738 0.000 {method 'append' of 'list' objects}
415657710 42.946 0.000 42.946 0.000 {built-in method builtins.isinstance}
392513114 124.763 0.000 124.763 0.000 sre_parse.py:233(__next)
376492023 82.130 0.000 198.495 0.000 sre_parse.py:254(get)
348455987 74.736 0.000 107.907 0.000 sre_parse.py:164(__getitem__)
334436671 19.770 0.000 19.770 0.000 {built-in method builtins.ord}
334436635 61.936 0.000 83.225 0.000 sre_parse.py:172(append)
268481147/258467844 21.821 0.000 24.453 0.000 {built-in method builtins.len}
116683821 14.129 0.000 14.129 0.000 {method 'get' of 'dict' objects}
56073536 10.777 0.000 10.777 0.000 {method 'find' of 'bytearray' objects}
50066018 8.885 0.000 12.946 0.000 sre_parse.py:249(match)
34044471 36.123 0.000 48.662 0.000 sre_parse.py:343(_escape)
28037125 6.911 0.000 6.911 0.000 {built-in method builtins.min}
28036789 20.275 0.000 28.359 0.000 sre_parse.py:295(_class_escape)
22029068 6.752 0.000 8.551 0.000 sre_parse.py:286(tell)
16021040 3.763 0.000 4.821 0.000 sre_parse.py:81(groups)
16021002 52.015 0.000 90.498 0.000 sre_compile.py:276(_optimize_charset)
16021002 22.614 0.000 28.193 0.000 sre_compile.py:249(_compile_charset)
16020980 8.344 0.000 17.727 0.000 enum.py:283(__call__)
16020980 9.383 0.000 9.383 0.000 enum.py:525(__new__)
16020974 4.313 0.000 5.658 0.000 sre_compile.py:595(isstring)
14018806 3.564 0.000 4.566 0.000 sre_parse.py:160(__len__)
14018361 2.315 0.000 2.315 0.000 sre_compile.py:453(_get_iscased)
12015779 10.676 0.000 12.672 0.000 sre_parse.py:408(_uniq)
10901861 44.430 0.000 1453.659 0.000 re.py:271(_compile)
Turns out is was a very easy optimisation since we already know the expected pattern of the annotations. I added a param for it tho in case someone wants to use a custom one (and to make tests simpler)
Now that file takes ~12s including added time for profiling
Hey Cara, a very large single PMC file seems to be stalling the
strip_annotation_markers
function during conversion. I've left it running for a few hours and it never finishes.The problem article is PMC4829797 which seems to be a book. The file is very big but I don't think it should stall the conversion completely. The file is: PMC4829797.xml.gz
For the moment, I'm basically skipping
strip_annotation_markers
so that I can do a full run for Cancermine, etc.The conversion command was: