Improve Tutor's shell autocompletion response time

ARMBouhali commented 1 year ago

Background

I've been using tutor for my deployments since Open edX's maple release, and over 6 months as a development environment. One of the yet-to-be-solved DevExp issues I've faced is the slowness of the shell autocomplete feature.

Here is my experience with testing a basic tutor [tab][tab] in bash.

When I was using my ~~late (:cry:) not-so-strong~~ ubuntu machine [4G CPU / 16GB DDR3L / SSD]: response time was over 3s before the available commands are completed.
When I try the same with my ubuntu work machine [7G CPU / 16GB DDR4 / SSD], I get an initial response time around 1s second, subsequent [tab] have around 0.4s response time (which is quite acceptable but not applicable to less powerful machines).

Tasks

A suggested roadmap:

Find at least one method to profile a tutor CLI autocomplete response time.
Profile response times of a vanilla Tutor CLI (no enabled plugins) autocomplete.
Profile response time of a Tutor CLI with additional plugins and note any overheads.
Profile performance of tutor installed from pypi.org (pip install tutor) vs tutor installed in editable mode (pip install -e git+<tutor-gh-repo>)
Experiment with potential solutions to improve a click CLI performance, and come out with tested recommendations.
Apply recommendations on tutor core and on any plugins providing CLI extensions.
Add CLI performance recommendations to Tutor's plugin development documenation.
Maybe add tests to tutor to ensure an acceptable response time criteria is met.

Notes

This article about Improving the performance of a slow click CLI claims that larger python module imports is one potential bottleneck, and suggests to either move imports inside the command logic, or better to completely separate execution logic from CLI logic.

FYI @regisb

regisb commented 1 year ago

Interesting. I agree that auto-completion is a tad slow. I'm curious to see what we will discover when we profile Tutor. (profiling is always fun!)

kdmccormick commented 7 months ago

Hey @ARMBouhali , do you have any interest in working on this?

regisb commented 1 month ago

I decided to give another stab at this issue. I'll write down my notes here.

First of all, it's quite easy to run tutor in "completion mode". For instance, to simulate the completion of tutor config printvalue DOCK<tab, we should run:

COMP_WORDS='tutor config printvalue DOCK' COMP_CWORD=3 _TUTOR_COMPLETE=bash_complete tutor

(I found this by running bash with the set -x option)

From that point it's relatively easy to profile shell completion. Create a prof.py file:

import sys
sys.argv = ["tutor"]
import tutor.commands.cli; tutor.commands.cli.main()

Then run this script:

COMP_WORDS='tutor config printvalue DOCK' COMP_CWORD=3 _TUTOR_COMPLETE=bash_complete python -mcProfile -s cumulative ./prof.py > autocomplete.prof

Here are the first items > 0.1s cumulative time. I highlighted a couple of suspicious items:

1197590 function calls (1077470 primitive calls) in 0.767 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    519/1    0.002    0.000    0.769    0.769 {built-in method builtins.exec}
        1    0.000    0.000    0.769    0.769 prof.py:1(<module>)
        1    0.000    0.000    0.489    0.489 cli.py:21(main)
        1    0.000    0.000    0.447    0.447 core.py:1155(__call__)
        1    0.000    0.000    0.447    0.447 core.py:1010(main)
        1    0.000    0.000    0.447    0.447 core.py:1123(_main_shell_completion)
        1    0.000    0.000    0.447    0.447 shell_completion.py:17(shell_complete)
        1    0.000    0.000    0.446    0.446 shell_completion.py:285(complete)
        1    0.000    0.000    0.446    0.446 shell_completion.py:263(get_completions)
        2    0.000    0.000    0.367    0.184 config.py:191(render_full) <-------------------------- why 2 calls?
  362/282    0.000    0.000    0.367    0.001 env.py:378(render_unknown)
      288    0.000    0.000    0.366    0.001 env.py:393(render_str) <-------------------------- can we make this any faster? Maybe with a single Renderer instance?
    264/4    0.002    0.000    0.286    0.071 <frozen importlib._bootstrap>:1349(_find_and_load)
    257/4    0.001    0.000    0.286    0.071 <frozen importlib._bootstrap>:1304(_find_and_load_unlocked)
   577/11    0.001    0.000    0.285    0.026 <frozen importlib._bootstrap>:480(_call_with_frames_removed)
    253/8    0.001    0.000    0.284    0.036 <frozen importlib._bootstrap>:911(_load_unlocked)
    228/8    0.001    0.000    0.284    0.035 <frozen importlib._bootstrap_external>:989(exec_module)
        1    0.000    0.000    0.284    0.284 core.py:2425(shell_complete)
        1    0.000    0.000    0.284    0.284 config.py:30(shell_complete)
        1    0.000    0.000    0.284    0.284 config.py:38(_shell_complete_config_items)
      183    0.000    0.000    0.284    0.002 config.py:47(_candidate_config_items)
        1    0.000    0.000    0.284    0.284 params.py:23(config)
        1    0.000    0.000    0.284    0.284 config.py:52(load_full)
        1    0.000    0.000    0.279    0.279 cli.py:1(<module>)
    58/13    0.000    0.000    0.262    0.020 {built-in method builtins.__import__}
   174/53    0.000    0.000    0.261    0.005 <frozen importlib._bootstrap>:1390(_handle_fromlist)
      6/3    0.000    0.000    0.204    0.068 actions.py:99(do)
      6/3    0.000    0.000    0.203    0.068 actions.py:118(do_from_context)
    40/18    0.000    0.000    0.203    0.011 actions.py:31(do)
      288    0.003    0.000    0.198    0.001 env.py:100(__init__) <-------------------------- this is the Renderer contructor called 288 times
104004/289    0.092    0.000    0.181    0.001 copy.py:118(deepcopy)
 2023/289    0.028    0.000    0.179    0.001 copy.py:217(_deepcopy_dict)
        2    0.000    0.000    0.172    0.086 config.py:1(<module>)
        4    0.000    0.000    0.171    0.043 utils.py:1(<module>)
      288    0.000    0.000    0.168    0.001 env.py:182(render_str)
        1    0.000    0.000    0.162    0.162 shell_completion.py:495(_resolve_context)
        2    0.000    0.000    0.161    0.080 core.py:1722(resolve_command)
        1    0.000    0.000    0.161    0.161 cli.py:46(get_command)
        1    0.000    0.000    0.161    0.161 cli.py:64(ensure_plugins_enabled)
        1    0.000    0.000    0.161    0.161 config.py:313(_enable_plugins)
       34    0.001    0.000    0.156    0.005 __init__.py:1(<module>)
        1    0.000    0.000    0.154    0.154 config.py:40(load_minimal)
      288    0.001    0.000    0.141    0.000 environment.py:1089(from_string)
      288    0.001    0.000    0.136    0.000 environment.py:728(compile)

We should definitely investigate why render_full is called twice, and how to optimize the render_str function.

openedx / wg-devops