open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Performance optimizations and logging option #192

Closed brockfanning closed 3 years ago

brockfanning commented 3 years ago

Fixes #191

This adds several performance optimizations to make sure the some methods don't run multiple times:

This also adds the "plumbing" for optional debug statements, and also adds some debug statements when various things start. The debug statements include elapsed time, which should help us track down any bottlenecks.

Here's an example of the debug output, without the above-mentioned optimizations, while running on the UK data:

00:00:00 - Starting input: InputCsvData
00:00:08 - Starting input: InputYamlMdMeta
00:00:44 - Starting translation input: TranslationInputSdgTranslations
00:00:50 - Starting input: InputCsvData
00:00:59 - Starting input: InputYamlMdMeta
00:01:36 - Starting translation input: TranslationInputSdgTranslations
00:01:43 - Starting input: InputCsvData
00:01:52 - Starting input: InputYamlMdMeta
00:02:31 - Starting translation input: TranslationInputSdgTranslations
00:03:06 - Starting input: InputCsvData
00:03:14 - Starting input: InputYamlMdMeta
00:03:53 - Starting translation input: TranslationInputSdgTranslations
00:04:09 - Starting output: OutputOpenSdg (en)
00:04:09 - Translating indicators into en
00:04:22 - Starting translation input: TranslationInputSdgTranslations
00:04:34 - Starting output: OutputOpenSdg (untranslated)
00:04:34 - Starting translation input: TranslationInputSdgTranslations
00:04:45 - Starting output: OutputGeoJson (en)
00:04:45 - Translating indicators into en
00:05:33 - Starting output: OutputGeoJson (untranslated)
00:06:09 - Starting output: OutputGeoJson (en)
00:06:09 - Translating indicators into en
00:06:52 - Starting output: OutputGeoJson (untranslated)
00:07:23 - Starting output: OutputGeoJson (en)
00:07:23 - Translating indicators into en
00:08:26 - Starting output: OutputGeoJson (untranslated)
00:09:16 - Starting translation input: TranslationInputSdgTranslations

You can see at the beginning that the data/metadata/translation inputs start multiple times. Here's the same thing, but with the above-mentioned optimizations:

00:00:00 - Starting input: InputCsvData
00:00:04 - Starting input: InputYamlMdMeta
00:00:20 - Starting translation input: TranslationInputSdgTranslations
00:01:19 - Starting output: OutputOpenSdg (en)
00:01:19 - Translating indicators into en
00:01:28 - Starting output: OutputOpenSdg (untranslated)
00:01:31 - Starting output: OutputGeoJson (en)
00:01:31 - Translating indicators into en
00:01:49 - Starting output: OutputGeoJson (untranslated)
00:01:53 - Starting output: OutputGeoJson (en)
00:01:53 - Translating indicators into en
00:01:56 - Starting output: OutputGeoJson (untranslated)
00:01:58 - Starting output: OutputGeoJson (en)
00:01:58 - Translating indicators into en
00:02:11 - Starting output: OutputGeoJson (untranslated)

So those optimizations seem to cut the build time down to about 25%! :)

Testing

Testing of this PR should make sure that the data builds still work as expected, preferably in both a mono-lingual and a multi-lingual scenario.

The requirements.txt file needs to include:

git+https://github.com/brockfanning/sdg-build-1@verbse-option

And the config_data.yml file needs to include:

logging:
  - debug

Another type logs available is "warn". So to see everything it would be:

logging:
  - debug
  - warn

The "warn" logs cover everything that was already being printed (like the duplicate values warnings).

brockfanning commented 3 years ago

@LucyGwilliamAdmin @jwestw This started as a logging ticket but turned into a performance optimization ticket, plus some logging. I've tried this out locally with a multilingual site and it seemed good. Testing details are in the PR description.

AnnCorp commented 3 years ago

@brockfanning great to see this reduces build time.

jwestw commented 3 years ago

From a conversation between Brock and me:

Brock: there are a few things could want to test:

  1. The debug output (containing the "numbers" aka times)
  2. The faster build time (aka performance optimizations)
  3. That the build still works when Open SDG uses it

3 is most important; can be tested on a feature branch as usual.

To test #1 needs local instance of sdg-build to see output To test #2 you could do a local comparison like you describe above - yes - running it once with my branch, and timing it, and then running again with the branch pointing at whatever the requirements.txt normally points at, and timing that.

James: I suggest using timeit or something similar. Already made an environment, cloned sdg-build repo and have installed all things from requirements.txt Ran python scripts/build_data.py

Brock: Will need to switch branches using git+https://github.com/brockfanning/sdg-build-1@verbse-option Then pip install -r scripts/requirements.txt --upgrade And re-time it

EDIT: timeit might not work as it only seems to be for small snippets of code. Perhaps printing out a time stamp at the start and end of the script and a time Delta might work .

LucyGwilliamAdmin commented 3 years ago

@jwestw I'll leave this one to you as looks like you already done quite a bit towards it :)

jwestw commented 3 years ago

Testing so far on WSL/Ubuntu installation. I have only run these tests once, but I made sure I was doing nothing else on the computer at the same time. I inserted some code to time the process.

from sdg.open_sdg import open_sdg_build
import time

start_time = time.time ()
open_sdg_build(config='config_data.yml')
secs = int(time.time() - start_time)
print(f"--- {secs} seconds ---") 

Without pointing verbose-option repo: --- 425 seconds ---

After pointing to verbose-option repo: --- 399 seconds ---

@brockfanning Please take a look and comment.

Thanks

brockfanning commented 3 years ago

@jwestw For me I get 447 seconds on 1.2.0, versus 166 seconds with this branch. Just to confirm - when you run the build script on 1.2.0 you are seeing a ton of those "Duplicate value" warnings, correct? Can you confirm that when you run the script with this branch, you are not seeing any of those warnings?

jwestw commented 3 years ago

The difference didn't seem massively significant, so I was wondering if the branch switch had gone correctly. I will continue to look into it this morning.

jwestw commented 3 years ago

Trying this again to make sure the lack of difference I saw yesterday wasn't because I'd accidentally run pip install -r ./scripts/requirements.txt --upgrade instead of pip3 ...

So first of all running the test while pointing to the original 1.2 branch, which is installed on my conda environment, called build_pfrm.

Without pointing to the verbose option repo: --- 463 seconds ---

So similar to yesterday's result.

Then running the upgrade using pip3, posting the output here for clarity:

(build_pfrm) james@DESKTOP-82548GK:~/sdg-data$ pip3 install -r ./scripts/requirements.txt --upgrade
Collecting git+https://github.com/dougmet/yamlmd (from -r ./scripts/requirements.txt (line 1))
  Cloning https://github.com/dougmet/yamlmd to /tmp/pip-req-build-ox0fht57
Collecting git+https://github.com/brockfanning/sdg-build-1@verbse-option (from -r ./scripts/requirements.txt (line 3))
  Cloning https://github.com/brockfanning/sdg-build-1 (to revision verbse-option) to /tmp/pip-req-build-b8hh1w4s

Then every message after that is :

Requirement already satisfied: ...

Which I am pretty sure is what I saw yesterday.

Running the test again, I run into the same problem. So I created an entirely new environment called build_pfrm_2.

(build_pfrm_2) james@DESKTOP-82548GK:~/sdg-data$ pip3 install -r ./scripts/requirements.txt 
Collecting git+https://github.com/dougmet/yamlmd (from -r ./scripts/requirements.txt (line 1))
  Cloning https://github.com/dougmet/yamlmd to /tmp/pip-req-build-sr7lre85
Collecting git+https://github.com/brockfanning/sdg-build-1@verbse-option (from -r ./scripts/requirements.txt (line 3))
  Cloning https://github.com/brockfanning/sdg-build-1 (to revision verbse-option) to /tmp/pip-req-build-odh1cpnj
Collecting ruamel.yaml>0.15
  Using cached ruamel.yaml-0.16.12-py2.py3-none-any.whl (111 kB)
Collecting gitpython
  Using cached GitPython-3.1.13-py3-none-any.whl (159 kB)...

Now pip3 freeze , shows I have this installed.

sdg @ git+https://github.com/brockfanning/sdg-build-1@9397a28980f489a820ad26004c3cb45a54879632

Which is indeed the most recent commit here.

And finally the result of the test is:

--- 154 seconds ---

brockfanning commented 3 years ago

@jwestw Sounds good, I think the next step should be a feature branch (data and site) to ensure that everything on the site side still behaves as expected.

jwestw commented 3 years ago

This is the feature branch.

It looks fine, but I am not sure if there's anything specific you want me to test.

Noting the last two test+build times here: (the second test (higher up the list) I'd pointed at the verbose-option branch)

image

brockfanning commented 3 years ago

@jwestw Nothing specific no, maybe just spot-check a few of the more complex indicators and make sure they look/behave the same as the live site?

brockfanning commented 3 years ago

I looked a few and saw no differences, so I'll go ahead and merge.