jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Troublesome PMC file stalls conversion #10

Closed jakelever closed 2 years ago

jakelever commented 2 years ago

Hey Cara, a very large single PMC file seems to be stalling the strip_annotation_markers function during conversion. I've left it running for a few hours and it never finishes.

The problem article is PMC4829797 which seems to be a book. The file is very big but I don't think it should stall the conversion completely. The file is: PMC4829797.xml.gz

For the moment, I'm basically skipping strip_annotation_markers so that I can do a full run for Cancermine, etc.

The conversion command was:

python src/convert.py --i PMC4829797.xml --iFormat pmcxml --o test.bioc.xml --oFormat biocxml
creisle commented 2 years ago

oh! that's not good. I will run a profile on it and see what's slowing it down so much. I can probably improve the performance, and if not I can definitely make marking/stripping the annotations completely optional at the top level

creisle commented 2 years ago

In case it makes a difference can you post your python version and the results of pip freeze?

creisle commented 2 years ago

Initial profiling results with python 3.7 confirm it is lagging in the strip annotation markers function


         4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds

   Ordered by: standard name
   List reduced from 668 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   111/84    0.000    0.000    0.017    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
       69    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:103(release)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:143(__init__)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:147(__enter__)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:151(__exit__)
       69    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:176(cb)
        5    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
     78/2    0.000    0.000    0.048    0.024 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
      844    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
       58    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:307(__init__)
       58    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:311(__enter__)
       58    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:318(__exit__)
      232    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:321(<genexpr>)
       46    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:35(_new_module)
       62    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:369(__init__)
      104    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:403(cached)
       76    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:416(parent)
       58    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:424(has_location)
       58    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
    58/56    0.000    0.000    0.007    0.000 <frozen importlib._bootstrap>:576(module_from_spec)
       64    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:58(__init__)
     58/2    0.000    0.000    0.049    0.025 <frozen importlib._bootstrap>:663(_load_unlocked)
       62    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:719(find_spec)
       69    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:78(acquire)

Sun Jan 30 11:22:07 2022    profile.issues10.out

         4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds

   Ordered by: cumulative time
   List reduced from 668 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     57/1    0.000    0.000 1513.447 1513.447 {built-in method builtins.exec}
        1    0.000    0.000 1513.447 1513.447 convert.py:2(<module>)
        1    0.000    0.000 1513.396 1513.396 main.py:26(convert)
        1    0.003    0.003 1513.394 1513.394 pmcxml.py:354(pmcxml2bioc)
      217    8.797    0.041 1502.503    6.924 utils.py:255(strip_annotation_markers)
  8010700    4.262    0.000 1471.568    0.000 re.py:180(search)
 10901861   44.430    0.000 1453.659    0.000 re.py:271(_compile)
  8010487   26.871    0.000 1362.766    0.000 sre_compile.py:759(compile)
  8010487   18.001    0.000  895.223    0.000 sre_parse.py:919(parse)
8010527/8010487   14.135    0.000  855.176    0.000 sre_parse.py:417(_parse_sub)
8010559/8010493  310.134    0.000  834.118    0.000 sre_parse.py:475(_parse)
  8010487    9.968    0.000  424.314    0.000 sre_compile.py:598(_code)
10013291/8010486   97.652    0.000  231.001    0.000 sre_compile.py:71(_compile)
376492023   82.130    0.000  198.495    0.000 sre_parse.py:254(get)
  8010487   30.688    0.000  182.857    0.000 sre_compile.py:536(_compile_info)
392513114  124.763    0.000  124.763    0.000 sre_parse.py:233(__next)
348455987   74.736    0.000  107.907    0.000 sre_parse.py:164(__getitem__)
 16021002   52.015    0.000   90.498    0.000 sre_compile.py:276(_optimize_charset)
334436635   61.936    0.000   83.225    0.000 sre_parse.py:172(append)
1322999171   77.738    0.000   77.738    0.000 {method 'append' of 'list' objects}
10013326/8010521   70.915    0.000   76.275    0.000 sre_parse.py:174(getwidth)
 34044471   36.123    0.000   48.662    0.000 sre_parse.py:343(_escape)
415657710   42.946    0.000   42.946    0.000 {built-in method builtins.isinstance}
  8010486   12.469    0.000   33.531    0.000 enum.py:833(__and__)
 28036789   20.275    0.000   28.359    0.000 sre_parse.py:295(_class_escape)

Sun Jan 30 11:22:07 2022    profile.issues10.out

         4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds

   Ordered by: internal time
   List reduced from 668 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
8010559/8010493  310.134    0.000  834.118    0.000 sre_parse.py:475(_parse)
392513114  124.763    0.000  124.763    0.000 sre_parse.py:233(__next)
10013291/8010486   97.652    0.000  231.001    0.000 sre_compile.py:71(_compile)
376492023   82.130    0.000  198.495    0.000 sre_parse.py:254(get)
1322999171   77.738    0.000   77.738    0.000 {method 'append' of 'list' objects}
348455987   74.736    0.000  107.907    0.000 sre_parse.py:164(__getitem__)
10013326/8010521   70.915    0.000   76.275    0.000 sre_parse.py:174(getwidth)
334436635   61.936    0.000   83.225    0.000 sre_parse.py:172(append)
 16021002   52.015    0.000   90.498    0.000 sre_compile.py:276(_optimize_charset)
 10901861   44.430    0.000 1453.659    0.000 re.py:271(_compile)
415657710   42.946    0.000   42.946    0.000 {built-in method builtins.isinstance}
 34044471   36.123    0.000   48.662    0.000 sre_parse.py:343(_escape)
  8010487   30.688    0.000  182.857    0.000 sre_compile.py:536(_compile_info)
  8010487   26.871    0.000 1362.766    0.000 sre_compile.py:759(compile)
 16021002   22.614    0.000   28.193    0.000 sre_compile.py:249(_compile_charset)
268481147/258467844   21.821    0.000   24.453    0.000 {built-in method builtins.len}
 28036789   20.275    0.000   28.359    0.000 sre_parse.py:295(_class_escape)
334436671   19.770    0.000   19.770    0.000 {built-in method builtins.ord}
  8010536   18.356    0.000   18.356    0.000 {method 'translate' of 'str' objects}
  8010487   18.001    0.000  895.223    0.000 sre_parse.py:919(parse)
8010492/8010484   15.199    0.000   21.142    0.000 sre_compile.py:461(_get_literal_prefix)
  8010699   15.056    0.000   15.056    0.000 {method 'search' of 're.Pattern' objects}
8010527/8010487   14.135    0.000  855.176    0.000 sre_parse.py:417(_parse_sub)
116683821   14.129    0.000   14.129    0.000 {method 'get' of 'dict' objects}
  2002615   13.713    0.000   13.956    0.000 sre_compile.py:432(_generate_overlap_table)

Sun Jan 30 11:22:07 2022    profile.issues10.out

         4520702188 function calls (4506575183 primitive calls) in 1513.447 seconds

   Ordered by: call count
   List reduced from 668 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1322999171   77.738    0.000   77.738    0.000 {method 'append' of 'list' objects}
415657710   42.946    0.000   42.946    0.000 {built-in method builtins.isinstance}
392513114  124.763    0.000  124.763    0.000 sre_parse.py:233(__next)
376492023   82.130    0.000  198.495    0.000 sre_parse.py:254(get)
348455987   74.736    0.000  107.907    0.000 sre_parse.py:164(__getitem__)
334436671   19.770    0.000   19.770    0.000 {built-in method builtins.ord}
334436635   61.936    0.000   83.225    0.000 sre_parse.py:172(append)
268481147/258467844   21.821    0.000   24.453    0.000 {built-in method builtins.len}
116683821   14.129    0.000   14.129    0.000 {method 'get' of 'dict' objects}
 56073536   10.777    0.000   10.777    0.000 {method 'find' of 'bytearray' objects}
 50066018    8.885    0.000   12.946    0.000 sre_parse.py:249(match)
 34044471   36.123    0.000   48.662    0.000 sre_parse.py:343(_escape)
 28037125    6.911    0.000    6.911    0.000 {built-in method builtins.min}
 28036789   20.275    0.000   28.359    0.000 sre_parse.py:295(_class_escape)
 22029068    6.752    0.000    8.551    0.000 sre_parse.py:286(tell)
 16021040    3.763    0.000    4.821    0.000 sre_parse.py:81(groups)
 16021002   52.015    0.000   90.498    0.000 sre_compile.py:276(_optimize_charset)
 16021002   22.614    0.000   28.193    0.000 sre_compile.py:249(_compile_charset)
 16020980    8.344    0.000   17.727    0.000 enum.py:283(__call__)
 16020980    9.383    0.000    9.383    0.000 enum.py:525(__new__)
 16020974    4.313    0.000    5.658    0.000 sre_compile.py:595(isstring)
 14018806    3.564    0.000    4.566    0.000 sre_parse.py:160(__len__)
 14018361    2.315    0.000    2.315    0.000 sre_compile.py:453(_get_iscased)
 12015779   10.676    0.000   12.672    0.000 sre_parse.py:408(_uniq)
 10901861   44.430    0.000 1453.659    0.000 re.py:271(_compile)
creisle commented 2 years ago

Turns out is was a very easy optimisation since we already know the expected pattern of the annotations. I added a param for it tho in case someone wants to use a custom one (and to make tests simpler)

Now that file takes ~12s including added time for profiling