owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
58 stars 18 forks source link

:sparkles: Speed-up grapher upsert #2899

Closed Marigold closed 3 days ago

Marigold commented 4 days ago

Improve grapher upsert performance. Tested on

ssh owid@staging-site-speedup-grapher-upsert "cd etl && poetry run python -m cProfile -s cumtime etl/command.py grapher://grapher/ihme_gbd/2024-05-20/gbd_incidence --private --grapher --only --force --workers 1" | head -n 100

It speeds it up from 1200s to ~800s.

Other updates:

owidbot commented 4 days ago
Quick links (staging server): Site Admin Wizard

Login: ssh owid@staging-site-speedup-grapher-upsert

chart-diff: ✅ No charts for review.
data-diff: ```diff = Dataset garden/agriculture/2024-03-26/attainable_yields = Table attainable_yields = Dataset garden/agriculture/2024-03-26/long_term_crop_yields = Table long_term_crop_yields = Dataset garden/agriculture/2024-03-26/long_term_wheat_yields = Table long_term_wheat_yields = Dataset garden/agriculture/2024-03-26/uk_long_term_yields = Table uk_long_term_yields = Dataset garden/agriculture/2024-05-23/daily_calories_per_person = Table daily_calories_per_person = Dataset garden/animal_welfare/2023-08-08/farmed_finfishes_used_for_food = Table farmed_finfishes_used_for_food = Dataset garden/animal_welfare/2023-08-14/number_of_farmed_fish = Table number_of_farmed_fish = Dataset garden/animal_welfare/2023-08-15/number_of_farmed_decapod_crustaceans = Table number_of_farmed_decapod_crustaceans = Dataset garden/animal_welfare/2023-08-16/number_of_wild_fish_killed_for_food = Table number_of_wild_fish_killed_for_food ~ Dataset garden/animal_welfare/2024-05-20/animals_used_for_food + + update_period_days: 365 = Table animals_used_for_food = Dataset garden/artificial_intelligence/2023-07-25/cset = Table cset = Dataset garden/climate_watch/2023-10-31/emissions_by_sector = Table greenhouse_gas_emissions_by_sector = Table nitrous_oxide_emissions_by_sector = Table methane_emissions_by_sector = Table fluorinated_gas_emissions_by_sector = Table carbon_dioxide_emissions_by_sector = Dataset garden/countries/2023-09-25/gleditsch = Table gleditsch_countries = Table gleditsch = Table gleditsch_regions = Dataset garden/countries/2023-09-25/isd = Table isd = Table isd_countries = Table isd_regions = Dataset garden/countries/2023-09-29/cow_ssm = Table cow_ssm_regions = Table cow_ssm_majors = Table cow_ssm_states = Table cow_ssm_system = Table cow_ssm_countries = Dataset garden/democracy/2024-03-07/bmr = Table num_countries_regime = Table population_regime_years = Table population_regime = Table bmr = Table num_countries_regime_years = Dataset garden/democracy/2024-03-07/eiu = Table num_people = Table num_countries = Table avg_pop = Table eiu = Dataset garden/democracy/2024-03-07/ert = Table ert = Table region_aggregates = Dataset garden/democracy/2024-03-07/fh = Table fh = Table fh_regions = Dataset garden/democracy/2024-03-07/lexical_index = Table region_aggregates = Table lexical_index = Dataset garden/democracy/2024-03-07/polity = Table num_people = Table num_countries = Table avg_pop = Table polity = Dataset garden/democracy/2024-03-07/vdem = Table vdem_population = Table vdem_num_countries = Table vdem_multi_with_regions = Table vdem_multi_without_regions = Table vdem = Dataset garden/demography/2022-12-08/population = Table population = Dataset garden/demography/2023-03-31/population = Table population = Table population_original = Dataset garden/demography/2023-06-12/population_density = Table population_density = Dataset garden/demography/2023-06-27/world_population_comparison = Table world_population_comparison = Dataset garden/demography/2023-07-03/population_doubling_times = Table population_doubling_times = Dataset garden/education/2023-07-17/education_barro_lee_projections = Table education_barro_lee_projections = Dataset garden/education/2023-07-17/education_lee_lee = Table education_lee_lee = Dataset garden/ember/2024-05-08/yearly_electricity = Table yearly_electricity = Dataset garden/emdat/2024-04-11/natural_disasters = Table natural_disasters_yearly_deaths = Table natural_disasters_decadal = Table natural_disasters_decadal_impact = Table natural_disasters_yearly = Table natural_disasters_yearly_impact = Table natural_disasters_decadal_deaths = Dataset garden/emissions/2024-04-08/national_contributions = Table national_contributions = Dataset garden/emissions/2024-06-20/gdp_and_co2_decoupling = Table gdp_and_co2_decoupling = Dataset garden/energy/2024-06-20/electricity_mix = Table electricity_mix = Dataset garden/energy/2024-06-20/energy_mix = Table energy_mix = Dataset garden/energy/2024-06-20/fossil_fuel_production = Table fossil_fuel_production = Dataset garden/energy/2024-06-20/primary_energy_consumption = Table primary_energy_consumption = Dataset garden/energy/2024-06-20/uk_historical_electricity = Table uk_historical_electricity = Dataset garden/eth/2023-03-15/ethnic_power_relations = Table ethnic_power_relations = Dataset garden/faostat/2024-03-14/additional_variables = Table share_of_sustainable_and_overexploited_fish = Table agriculture_land_use_evolution = Table land_spared_by_increased_crop_yields = Table fertilizer_exports = Table maize_and_wheat = Table area_used_per_crop_type = Table cereal_allocation = Table macronutrient_compositions = Table fertilizers = Table hypothetical_meat_consumption = Table food_available_for_consumption = Table arable_land_per_crop_output = Table vegetable_oil_yields = Dataset garden/faostat/2024-03-14/faostat_cahd = Table faostat_cahd = Table faostat_cahd_flat = Dataset garden/faostat/2024-03-14/faostat_ei = Table faostat_ei = Table faostat_ei_flat = Dataset garden/faostat/2024-03-14/faostat_ek = Table faostat_ek = Table faostat_ek_flat = Dataset garden/faostat/2024-03-14/faostat_emn = Table faostat_emn_flat = Table faostat_emn = Dataset garden/faostat/2024-03-14/faostat_esb = Table faostat_esb_flat = Table faostat_esb = Dataset garden/faostat/2024-03-14/faostat_fa = Table faostat_fa_flat = Table faostat_fa = Dataset garden/faostat/2024-03-14/faostat_fbsc = Table faostat_fbsc_flat = Table faostat_fbsc = Dataset garden/faostat/2024-03-14/faostat_fo = Table faostat_fo_flat = Table faostat_fo = Dataset garden/faostat/2024-03-14/faostat_food_explorer = Table faostat_food_explorer = Dataset garden/faostat/2024-03-14/faostat_fs = Table faostat_fs_flat = Table faostat_fs = Dataset garden/faostat/2024-03-14/faostat_ic = Table faostat_ic_flat = Table faostat_ic = Dataset garden/faostat/2024-03-14/faostat_lc = Table faostat_lc = Table faostat_lc_flat = Dataset garden/faostat/2024-03-14/faostat_qcl = Table faostat_qcl = Table faostat_qcl_flat = Dataset garden/faostat/2024-03-14/faostat_qi = Table faostat_qi_flat = Table faostat_qi = Dataset garden/faostat/2024-03-14/faostat_qv = Table faostat_qv_flat = Table faostat_qv = Dataset garden/faostat/2024-03-14/faostat_rfb = Table faostat_rfb = Table faostat_rfb_flat = Dataset garden/faostat/2024-03-14/faostat_rfn = Table faostat_rfn = Table faostat_rfn_flat = Dataset garden/faostat/2024-03-14/faostat_rl = Table faostat_rl = Table faostat_rl_flat = Dataset garden/faostat/2024-03-14/faostat_rp = Table faostat_rp = Table faostat_rp_flat = Dataset garden/faostat/2024-03-14/faostat_rt = Table faostat_rt = Table faostat_rt_flat = Dataset garden/faostat/2024-03-14/faostat_scl = Table faostat_scl = Table faostat_scl_flat = Dataset garden/faostat/2024-03-14/faostat_sdgb = Table faostat_sdgb = Table faostat_sdgb_flat = Dataset garden/faostat/2024-03-14/faostat_tcl = Table faostat_tcl = Table faostat_tcl_flat = Dataset garden/faostat/2024-03-14/faostat_ti = Table faostat_ti_flat = Table faostat_ti = Dataset garden/gcp/2024-06-20/global_carbon_budget = Table global_carbon_budget = Dataset garden/happiness/2024-06-09/happiness = Table happiness = Dataset garden/harvard/2023-09-18/colonial_dates_dataset = Table colonial_dates_dataset = Dataset garden/health/2023-08-09/unaids = Table unaids = Dataset garden/health/2023-08-16/deaths_karlinsky = Table deaths_karlinsky = Dataset garden/health/2024-04-02/organ_donation_and_transplantation = Table organ_donation_and_transplantation = Dataset garden/lgbt_rights/2023-04-27/lgbti_policy_index = Table lgbti_policy_index = Dataset garden/lgbt_rights/2024-06-03/equaldex = Table equaldex = Dataset garden/lgbt_rights/2024-06-11/criminalization_mignot = Table criminalization_mignot = Dataset garden/missing_data/2024-03-26/children_out_of_school = Table children_out_of_school = Dataset garden/missing_data/2024-03-26/who_md_suicides = Table who_md_suicides = Dataset garden/missing_data/2024-03-26/who_neuropsychiatric_conditions = Table neuropsychiatric_conditions = Dataset garden/news/2024-05-08/guardian_mentions = Table avg_10y ~ Dim country + + New values: 30 / 279 (10.75%) year country 2023 Akrotiri and Dhekelia 2023 European Union (27) 2023 Low-income countries 2023 World 2023 Yemen People's Republic ~ Dim year + + New values: 30 / 279 (10.75%) country year Akrotiri and Dhekelia 2023 European Union (27) 2023 Low-income countries 2023 World 2023 Yemen People's Republic 2023 ~ Column num_pages_mentions_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year num_pages_mentions_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column num_pages_mentions_per_million_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year num_pages_mentions_per_million_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column num_pages_tags_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year num_pages_tags_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column num_pages_tags_per_million_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year num_pages_tags_per_million_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column relative_pages_mentions_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year relative_pages_mentions_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column relative_pages_mentions_excluded_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year relative_pages_mentions_excluded_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column relative_pages_tags_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year relative_pages_tags_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN ~ Column relative_pages_tags_excluded_10y_avg (new data) + + New values: 30 / 279 (10.75%) country year relative_pages_tags_excluded_10y_avg Akrotiri and Dhekelia 2023 NaN European Union (27) 2023 NaN Low-income countries 2023 NaN World 2023 NaN Yemen People's Republic 2023 NaN = Table guardian_mentions = Dataset garden/oecd/2023-05-19/co2_air_transport = Table co2_air_transport = Dataset garden/owid/latest/key_indicators = Table population_density = Table land_area = Table population = Dataset garden/pew/2024-06-03/same_sex_marriage = Table same_sex_marriage = Dataset garden/research_development/2024-05-20/patents_wdi_unwpp = Table patents_articles ~ Dim country + + New values: 716 / 6288 (11.39%) year country 1999 Chad 2019 Ethiopia 2021 Finland 1997 Kosovo 1996 Niger ~ Dim year + + New values: 716 / 6288 (11.39%) country year Chad 1999 Ethiopia 2019 Finland 2021 Kosovo 1997 Niger 1996 ~ Column articles_per_million (new data, changed data) + + New values: 716 / 6288 (11.39%) country year articles_per_million Chad 1999 0.448515 Ethiopia 2019 22.667601 Finland 2021 NaN Kosovo 1997 0.000000 Niger 1996 1.903460 ~ Changed values: 3662 / 6288 (58.24%) country year articles_per_million - articles_per_million + Israel 2001 1489.280884 1502.079834 Israel 2005 1557.984009 1563.316040 Norway 2014 2079.426758 2033.840088 Philippines 2000 4.384143 4.434683 Singapore 2011 1933.671997 1920.811646 ~ Column patents_per_million (new data, changed data) + + New values: 716 / 6288 (11.39%) country year patents_per_million Chad 1999 NaN Ethiopia 2019 NaN Finland 2021 281.250916 Kosovo 1997 NaN Niger 1996 NaN ~ Changed values: 19 / 6288 (0.30%) country year patents_per_million - patents_per_million + Australia 1980 447.562408 26.995180 World 1995 122.405205 122.387794 World 2013 234.546326 235.663467 World 2017 295.177948 296.243774 Zambia 2020 0.845321 0.898154 = Dataset garden/smoking/2024-05-30/cigarette_sales = Table cigarette_sales = Dataset garden/state_capacity/2023-10-19/state_capacity_dataset = Table state_capacity_dataset = Dataset garden/technology/2022/internet = Table users = Dataset garden/terrorism/2023-07-20/global_terrorism_database = Table global_terrorism_database = Dataset garden/tourism/2023-05-05/unwto = Table unwto = Dataset garden/tuberculosis/2023-11-27/burden_disaggregated = Table burden_disaggregated_rate = Table burden_disaggregated ~ Dim country + + New values: 1632 / 20181 (8.09%) year age_group sex risk_factor country 2022 45-54 a smk Asia (UN) 2022 65plus a und Isle of Man 2022 15-24 a und Kosovo 2022 45-54 a dia Kosovo 2022 55-64 a smk Saint Helena ~ Dim year + + New values: 1632 / 20181 (8.09%) country age_group sex risk_factor year Asia (UN) 45-54 a smk 2022 Isle of Man 65plus a und 2022 Kosovo 15-24 a und 2022 Kosovo 45-54 a dia 2022 Saint Helena 55-64 a smk 2022 ~ Dim age_group + + New values: 1632 / 20181 (8.09%) country year sex risk_factor age_group Asia (UN) 2022 a smk 45-54 Isle of Man 2022 a und 65plus Kosovo 2022 a und 15-24 Kosovo 2022 a dia 45-54 Saint Helena 2022 a smk 55-64 ~ Dim sex + + New values: 1632 / 20181 (8.09%) country year age_group risk_factor sex Asia (UN) 2022 45-54 smk a Isle of Man 2022 65plus und a Kosovo 2022 15-24 und a Kosovo 2022 45-54 dia a Saint Helena 2022 55-64 smk a ~ Dim risk_factor + + New values: 1632 / 20181 (8.09%) country year age_group sex risk_factor Asia (UN) 2022 45-54 a smk Isle of Man 2022 65plus a und Kosovo 2022 15-24 a und Kosovo 2022 45-54 a dia Saint Helena 2022 55-64 a smk ~ Column best (new data) + + New values: 1632 / 20181 (8.09%) country year age_group sex risk_factor best Asia (UN) 2022 45-54 a smk 0 Isle of Man 2022 65plus a und 0 Kosovo 2022 15-24 a und 0 Kosovo 2022 45-54 a dia 0 Saint Helena 2022 55-64 a smk 0 ~ Column hi (new data) + + New values: 1632 / 20181 (8.09%) country year age_group sex risk_factor hi Asia (UN) 2022 45-54 a smk 0 Isle of Man 2022 65plus a und 0 Kosovo 2022 15-24 a und 0 Kosovo 2022 45-54 a dia 0 Saint Helena 2022 55-64 a smk 0 ~ Column lo (new data) + + New values: 1632 / 20181 (8.09%) country year age_group sex risk_factor lo Asia (UN) 2022 45-54 a smk 0 Isle of Man 2022 65plus a und 0 Kosovo 2022 15-24 a und 0 Kosovo 2022 45-54 a dia 0 Saint Helena 2022 55-64 a smk 0 = Dataset garden/tuberculosis/2023-11-27/burden_estimates = Table burden_estimates = Dataset garden/un/2023-08-02/comtrade_pandemics = Table comtrade_pandemics = Dataset garden/un/2023-10-09/plastic_waste = Table plastic_waste = Dataset garden/un/2023-10-30/un_members = Table un_members = Dataset garden/unep/2023-03-17/consumption_controlled_substances = Table consumption_controlled_substances = Dataset garden/war/2023-09-21/brecke = Table brecke = Dataset garden/war/2023-09-21/cow = Table cow = Table cow_locations = Table cow_country = Dataset garden/war/2023-09-21/cow_mid = Table cow_mid = Table cow_mid_country = Dataset garden/war/2023-09-21/mars = Table mars_country = Table mars = Dataset garden/war/2023-09-21/mie = Table mie = Table mie_country = Dataset garden/war/2023-09-21/prio_v31 = Table prio_v31 = Table prio_v31_country = Dataset garden/war/2023-09-21/ucdp = Table ucdp = Table ucdp_locations = Table ucdp_country = Dataset garden/war/2023-09-21/ucdp_prio = Table ucdp_prio = Dataset garden/war/2023-09-27/peace_diehl = Table peace_diehl_agg = Table peace_diehl = Dataset garden/war/2024-01-23/nuclear_weapons_treaties = Table nuclear_weapons_treaties_country_counts = Table nuclear_weapons_treaties = Dataset garden/wash/2024-01-06/who = Table who 2024-06-27 09:26:15 [error ] Traceback (most recent call last): File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/__init__.py", line 514, in loads return _default_decoder.decode(s) File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 386, in decode obj, end = self.raw_decode(s) File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 416, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/owid/etl/etl/datadiff.py", line 423, in cli lines = future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/owid/etl/etl/datadiff.py", line 416, in func differ.summary() File "/home/owid/etl/etl/datadiff.py", line 254, in summary self._diff_tables(self.ds_a, self.ds_b, table_name) File "/home/owid/etl/etl/datadiff.py", line 122, in _diff_tables table_a = future_a.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 330, in wrapped_f return self(f, *args, **kw) File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 467, in __call__ do = self.iter(retry_state=retry_state) File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 368, in iter result = action(retry_state) File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 390, in self._add_action_func(lambda rs: rs.outcome.result()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 470, in __call__ result = fn(*args, **kwargs) File "/home/owid/etl/etl/datadiff.py", line 837, in get_table_with_retry return ds[table_name] File "/home/owid/etl/etl/datadiff.py", line 278, in __getitem__ return tables.load() File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 312, in load return self.iloc[0].load() # type: ignore File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 363, in load return Table.read(uri) File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 177, in read table = cls.read_feather(path) File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 360, in read_feather cls._add_metadata(df, path) File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 333, in _add_metadata metadata = cls._read_metadata(path) File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 394, in _read_metadata return cast(Dict[str, Any], requests.get(metadata_path).json()) File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 978, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0) = Dataset garden/who/2024-02-14/gho_suicides = Table gho_suicides_ratio = Table gho_suicides = Dataset garden/who/2024-04-08/polio = Table polio = Dataset garden/who/2024-05-20/vehicles = Table vehicles ⚠ Found errors, create an issue please Legend: +New ~Modified -Removed =Identical Details Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet ``` Automatically updated datasets matching _weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk_ are not included

Edited: 2024-06-27 09:26:19 UTC Execution time: 809.70 seconds