Closed brockfanning closed 3 years ago
@LucyGwilliamAdmin @jwestw This started as a logging ticket but turned into a performance optimization ticket, plus some logging. I've tried this out locally with a multilingual site and it seemed good. Testing details are in the PR description.
@brockfanning great to see this reduces build time.
From a conversation between Brock and me:
Brock: there are a few things could want to test:
To test #1 needs local instance of sdg-build to see output To test #2 you could do a local comparison like you describe above - yes - running it once with my branch, and timing it, and then running again with the branch pointing at whatever the requirements.txt normally points at, and timing that.
James: I suggest using timeit or something similar. Already made an environment, cloned sdg-build repo and have installed all things from requirements.txt Ran python scripts/build_data.py
Brock:
Will need to switch branches using
git+https://github.com/brockfanning/sdg-build-1@verbse-option
Then pip install -r scripts/requirements.txt --upgrade
And re-time it
EDIT: timeit might not work as it only seems to be for small snippets of code. Perhaps printing out a time stamp at the start and end of the script and a time Delta might work .
@jwestw I'll leave this one to you as looks like you already done quite a bit towards it :)
Testing so far on WSL/Ubuntu installation. I have only run these tests once, but I made sure I was doing nothing else on the computer at the same time. I inserted some code to time the process.
from sdg.open_sdg import open_sdg_build
import time
start_time = time.time ()
open_sdg_build(config='config_data.yml')
secs = int(time.time() - start_time)
print(f"--- {secs} seconds ---")
Without pointing verbose-option repo: --- 425 seconds ---
After pointing to verbose-option repo: --- 399 seconds ---
@brockfanning Please take a look and comment.
Thanks
@jwestw For me I get 447 seconds on 1.2.0, versus 166 seconds with this branch. Just to confirm - when you run the build script on 1.2.0 you are seeing a ton of those "Duplicate value" warnings, correct? Can you confirm that when you run the script with this branch, you are not seeing any of those warnings?
The difference didn't seem massively significant, so I was wondering if the branch switch had gone correctly. I will continue to look into it this morning.
Trying this again to make sure the lack of difference I saw yesterday wasn't because I'd accidentally run pip install -r ./scripts/requirements.txt --upgrade
instead of pip3
...
So first of all running the test while pointing to the original 1.2 branch, which is installed on my conda environment, called build_pfrm
.
Without pointing to the verbose option repo: --- 463 seconds ---
So similar to yesterday's result.
Then running the upgrade using pip3
, posting the output here for clarity:
(build_pfrm) james@DESKTOP-82548GK:~/sdg-data$ pip3 install -r ./scripts/requirements.txt --upgrade
Collecting git+https://github.com/dougmet/yamlmd (from -r ./scripts/requirements.txt (line 1))
Cloning https://github.com/dougmet/yamlmd to /tmp/pip-req-build-ox0fht57
Collecting git+https://github.com/brockfanning/sdg-build-1@verbse-option (from -r ./scripts/requirements.txt (line 3))
Cloning https://github.com/brockfanning/sdg-build-1 (to revision verbse-option) to /tmp/pip-req-build-b8hh1w4s
Then every message after that is :
Requirement already satisfied: ...
Which I am pretty sure is what I saw yesterday.
Running the test again, I run into the same problem. So I created an entirely new environment called build_pfrm_2
.
(build_pfrm_2) james@DESKTOP-82548GK:~/sdg-data$ pip3 install -r ./scripts/requirements.txt
Collecting git+https://github.com/dougmet/yamlmd (from -r ./scripts/requirements.txt (line 1))
Cloning https://github.com/dougmet/yamlmd to /tmp/pip-req-build-sr7lre85
Collecting git+https://github.com/brockfanning/sdg-build-1@verbse-option (from -r ./scripts/requirements.txt (line 3))
Cloning https://github.com/brockfanning/sdg-build-1 (to revision verbse-option) to /tmp/pip-req-build-odh1cpnj
Collecting ruamel.yaml>0.15
Using cached ruamel.yaml-0.16.12-py2.py3-none-any.whl (111 kB)
Collecting gitpython
Using cached GitPython-3.1.13-py3-none-any.whl (159 kB)...
Now pip3 freeze
, shows I have this installed.
sdg @ git+https://github.com/brockfanning/sdg-build-1@9397a28980f489a820ad26004c3cb45a54879632
Which is indeed the most recent commit here.
And finally the result of the test is:
--- 154 seconds ---
@jwestw Sounds good, I think the next step should be a feature branch (data and site) to ensure that everything on the site side still behaves as expected.
This is the feature branch.
It looks fine, but I am not sure if there's anything specific you want me to test.
Noting the last two test+build times here: (the second test (higher up the list) I'd pointed at the verbose-option branch)
@jwestw Nothing specific no, maybe just spot-check a few of the more complex indicators and make sure they look/behave the same as the live site?
I looked a few and saw no differences, so I'll go ahead and merge.
Fixes #191
This adds several performance optimizations to make sure the some methods don't run multiple times:
This also adds the "plumbing" for optional debug statements, and also adds some debug statements when various things start. The debug statements include elapsed time, which should help us track down any bottlenecks.
Here's an example of the debug output, without the above-mentioned optimizations, while running on the UK data:
You can see at the beginning that the data/metadata/translation inputs start multiple times. Here's the same thing, but with the above-mentioned optimizations:
So those optimizations seem to cut the build time down to about 25%! :)
Testing
Testing of this PR should make sure that the data builds still work as expected, preferably in both a mono-lingual and a multi-lingual scenario.
The requirements.txt file needs to include:
And the config_data.yml file needs to include:
Another type logs available is "warn". So to see everything it would be:
The "warn" logs cover everything that was already being printed (like the duplicate values warnings).