ultrabug / mkdocs-static-i18n

MkDocs i18n plugin using static translation markdown files
https://ultrabug.github.io/mkdocs-static-i18n/
MIT License
223 stars 37 forks source link

fix and improve reconfiture_search_duplicates #253

Closed AngryMane closed 11 months ago

AngryMane commented 11 months ago

@ultrabug I noticed another challenge when creating a PR a few days ago, so I am issuing this PR. Please review it.

description

I improved performance of reconfigure_search_duplicates and fix (maybe) a bug.

effects of Improvements

Bulid took 9m47.324s so far, but takes 0m12.866s after this improment.

# generate many files
$ ./copy.sh 

$ time mkdocs build
INFO    -  mkdocs_static_i18n: Adding 'ja' to the 'search' plugin 'lang' option
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /home/yosuke/work/git/sample-mkdocs-prj/site
INFO    -  mkdocs_static_i18n: Building 'ja' documentation to directory: /home/yosuke/work/git/sample-mkdocs-prj/site
INFO    -  Documentation built in 6.30 seconds
INFO    -  Documentation built in 587.09 seconds

real    9m47.324s
user    9m47.103s
sys     0m0.220s

$ time mkdocs build
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /home/yosuke/work/git/sample-mkdocs-prj/site
INFO    -  mkdocs_static_i18n: Building 'ja' documentation to directory: /home/yosuke/work/git/sample-mkdocs-prj/site
INFO    -  Documentation built in 6.31 seconds
INFO    -  Documentation built in 12.64 seconds

real    0m12.866s
user    0m12.641s
sys     0m0.220s

This is logged with following env.

root cause

reconfigure_search_duplicates takes too much time. See following result.

$ vmprofshow vmlog  tree  --prune_percent 1
100.0%  _run_module_as_main  100.0%  /usr/lib/python3.10/runpy.py:171
100.0% |. _run_code  100.0%  /usr/lib/python3.10/runpy.py:63
100.0% |.|. <native symbol 0x7fefa18a73c0>  100.0%  -:0
100.0% |.|.|. main  100.0%  /home/yosuke/.local/lib/python3.10/site-packages/vmprof/__main__.py:30
100.0% |.|.|.|. run_path  100.0%  /usr/lib/python3.10/runpy.py:245
100.0% |.|.|.|.|. _run_module_code  100.0%  /usr/lib/python3.10/runpy.py:89
100.0% |.|.|.|.|.|. _run_code  100.0%  /usr/lib/python3.10/runpy.py:63
100.0% |.|.|.|.|.|.|. ImpLoader  100.0%  /usr/lib/python3.10/pkgutil.py:269
100.0% |.|.|.|.|.|.|.|. __call__  100.0%  /usr/lib/python3/dist-packages/click/core.py:1126
100.0% |.|.|.|.|.|.|.|.|. main  100.0%  /usr/lib/python3/dist-packages/click/core.py:981
100.0% |.|.|.|.|.|.|.|.|.|. invoke  100.0%  /usr/lib/python3/dist-packages/click/core.py:1625
100.0% |.|.|.|.|.|.|.|.|.|.|. invoke  100.0%  /usr/lib/python3/dist-packages/click/core.py:1384
100.0% |.|.|.|.|.|.|.|.|.|.|.|. invoke  100.0%  /usr/lib/python3/dist-packages/click/core.py:703
100.0% |.|.|.|.|.|.|.|.|.|.|.|.|. build_command  100.0%  /home/yosuke/work/git/mkdocs/mkdocs/__main__.py:273
100.0% |.|.|.|.|.|.|.|.|.|.|.|.|.|. build  100.0%  /home/yosuke/.local/lib/python3.10/site-packages/mkdocs/commands/build.py:258
 98.9% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|. on_post_build  98.9%  /home/yosuke/.local/lib/python3.10/site-packages/mkdocs/plugins.py:541
 98.9% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. run_event  100.0%  /home/yosuke/.local/lib/python3.10/site-packages/mkdocs/plugins.py:492
 98.9% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. on_post_build  100.0%  /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/plugin.py:146
  1.1% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. build  1.1%  /home/yosuke/.local/lib/python3.10/site-packages/mkdocs/commands/build.py:258
 97.8% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. reconfigure_search_index  98.9%  /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/reconfigure.py:486
 97.8% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. reconfigure_search_duplicates  100.0%  /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/reconfigure.py:449
 74.6% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. <native symbol 0x55cc6a805121>  76.3%  -:0
  8.0% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. <native symbol 0x55cc6a8f7611>  10.8%  -:0
  1.6% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. PyUnicode_RichCompare  19.7%  -:0
  2.0% |.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|. <lambda>  2.1%  /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/reconfigure.py:473

The first number in the line is the percentage of the total execution time of the function. As you can see, reconfigure_search_duplicates function takes too much time.

Then, I logged with line_profiler. See following result.

$ python3 -m line_profiler -rmt "__main__.py.lprof"
Timer unit: 1e-06 s

Total time: 638.575 s
File: /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/reconfigure.py
Function: reconfigure_search_duplicates at line 449

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   449                                               @profile
   450                                               def reconfigure_search_duplicates(self, search_index_entries):
   451                                                   """
   452                                                   We want to avoid indexing the same pages twice if the default language
   453                                                   has its own version built as well as the /language version too as this
   454                                                   would pollute the search results.
   455                                                   When this happens, we favor the default language location if its
   456                                                   content is the same as its /language counterpart.
   457                                                   """
   458         2          2.1      1.1      0.0          default_lang_entries = filter(
   459         1          1.0      1.0      0.0              lambda x: not x["location"].startswith(
   460                                                           tuple([f"{lang}/" for lang in self.build_languages])
   461                                                       ),
   462         1          0.1      0.1      0.0              search_index_entries,
   463                                                   )
   464         2     110505.5  55252.8      0.0          target_lang_entries = list(
   465         2          0.8      0.4      0.0              filter(
   466         1          0.3      0.3      0.0                  lambda x: x["location"].startswith(
   467                                                               tuple([f"{lang}/" for lang in self.build_languages])
   468                                                           ),
   469         1          0.1      0.1      0.0                  search_index_entries,
   470                                                       )
   471                                                   )
   472      7526     292071.2     38.8      0.0          for default_lang_entry in default_lang_entries:
   473     15050       8913.7      0.6      0.0              duplicated_entries = filter(
   474      7525       3248.7      0.4      0.0                  lambda x: x["title"] == default_lang_entry["title"]
   475                                                           and x["location"].endswith(x["location"])
   476                                                           and x["text"] == default_lang_entry["text"],
   477      7525       1033.0      0.1      0.0                  target_lang_entries,
   478                                                       )
   479   2182249   29943908.1     13.7      4.7              for duplicated_entry in duplicated_entries:
   480   2174724  605155199.4    278.3     94.8                  if duplicated_entry in search_index_entries:
   481     15048     245214.8     16.3      0.0                      log.debug(
   482     15048       5767.5      0.4      0.0                          f"removed duplicated search entry: {duplicated_entry['title']} "
   483      7524       1247.3      0.2      0.0                          f"{duplicated_entry['location']}"
   484                                                               )
   485      7524    2807696.3    373.2      0.4                      search_index_entries.remove(duplicated_entry)

638.57 seconds - /home/yosuke/work/git/mkdocs-static-i18n/mkdocs_static_i18n/reconfigure.py:449 - reconfigure_search_duplicates

The following two points are important.

How to improve

ultrabug commented 11 months ago

Yep, both a bug fix and a master PR, :pray: a big thank you and congratulations!