owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
86 stars 23 forks source link

🎉 App to find similar insights #3518

Closed pabloarosado closed 2 weeks ago

pabloarosado commented 2 weeks ago

Create a script that launches a streamlit app to do a semantic search over data insights.

The script loads and parses data insights (from the database), creates an embedding (on my laptop, it takes less than 10 seconds, but ideally this should happen under the hood, and store embeddings in the database), and sorts DIs by semantic similarity with respect to a given input string. For now, this is an experiment. If we decide it's useful, we can integrate it on our wizard.

I think it would be useful to have something like this on our wizard. For authors, it could be useful to find what has already been written about a certain topic. And for data peeps, it can open doors to do other kinds of analytics and experiments with our content.

The downside is that it requires installing some big libraries (transformers and pytorch). The first time it's build it needs to download some models, which are ~100MB. But maybe this can be useful for other similar applications.

owidbot commented 2 weeks ago
Quick links (staging server): Site Admin Wizard Docs

Login: ssh owid@staging-site-app-to-find-similar-insights

chart-diff: ✅ No charts for review.
data-diff: ❌ Found differences ```diff = Dataset garden/un/2024-04-09/undp_hdr = Table undp_hdr ~ Column abr (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column co2_prod (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column coef_ineq (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column diff_hdi_phdi (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column eys (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column eys_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column eys_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column gdi (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column gdi_group (changed metadata, changed data) + + description_processing: |- + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ~ Changed values: 11 / 7161 (0.15%) country year gdi_group - gdi_group + Europe 2022 1.268419 High-income countries 2022 1.392950 Lower-middle-income countries 2022 4.389009 South America 2022 1.150919 Upper-middle-income countries 2022 2.057359 ~ Column gii (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column gii_rank (changed metadata, changed data) + + description_processing: |- + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ~ Changed values: 9 / 7161 (0.13%) country year gii_rank - gii_rank + Asia 2022 3579 Europe 2022 1089 High-income countries 2022 1832 South America 2022 1092 Upper-middle-income countries 2022 3799 ~ Column gni_pc_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column gni_pc_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column gnipc (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column hdi (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column hdi_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column hdi_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column hdi_rank (changed metadata, changed data) + + description_processing: |- + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ~ Changed values: 11 / 7161 (0.15%) country year hdi_rank - hdi_rank + Europe 2022 1537 High-income countries 2022 2161 Lower-middle-income countries 2022 7099 South America 2022 1054 Upper-middle-income countries 2022 4964 ~ Column ihdi (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column ineq_edu (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column ineq_inc (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column ineq_le (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column le (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column le_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column le_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column lfpr_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column lfpr_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column loss (changed metadata, changed data) + + description_processing: |- + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ~ Changed values: 82 / 7161 (1.15%) country year loss - loss + Africa 2015 NaN 1714.844482 Africa 2021 NaN 1684.291626 Europe 2020 NaN 351.625641 High-income countries 2019 NaN 535.104187 Lower-middle-income countries 2018 NaN 1293.413940 ~ Column mf (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column mmr (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column mys (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column mys_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column mys_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column phdi (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column pop_total (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column pr_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column pr_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column rankdiff_hdi_phdi (changed metadata, changed data) + + description_processing: |- + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ~ Changed values: 6 / 7161 (0.08%) country year rankdiff_hdi_phdi - rankdiff_hdi_phdi + Africa 2022 98 Asia 2022 -340 Europe 2022 100 European Union (27) 2022 79 South America 2022 130 ~ Column se_f (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ ~ Column se_m (changed metadata) - - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. + + - We calculated averages over continents and income groups by taking the population-weighted average of the countries in each group. If less than 80% of countries in an area report data for a given year, we do not calculate the average for that area. ? ++ = Dataset garden/who/2024-09-09/flu_test = Table flu_test ~ Dim country - - Removed values: 63 / 71983 (0.09%) date country 2024-10-28 Malta 2024-10-28 Qatar 2024-10-28 Slovenia 2024-10-28 South Africa 2024-10-07 Zambia ~ Dim date - - Removed values: 63 / 71983 (0.09%) country date Malta 2024-10-28 Qatar 2024-10-28 Slovenia 2024-10-28 South Africa 2024-10-28 Zambia 2024-10-07 ~ Column denomcombined (changed data) - - Removed values: 63 / 71983 (0.09%) country date denomcombined Malta 2024-10-28 301 Qatar 2024-10-28 765 Slovenia 2024-10-28 983 South Africa 2024-10-28 85 Zambia 2024-10-07 110 ~ Changed values: 106 / 71983 (0.15%) country date denomcombined - denomcombined + Brazil 2024-10-21 5188 4218 Honduras 2024-10-07 70 68 Indonesia 2023-10-09 37 38 Slovenia 2024-10-21 1224 1183 Uganda 2024-09-23 58 51 ~ Column pcnt_poscombined (changed data) - - Removed values: 63 / 71983 (0.09%) country date pcnt_poscombined Malta 2024-10-28 2.325581 Qatar 2024-10-28 17.385620 Slovenia 2024-10-28 0.305188 South Africa 2024-10-28 5.882353 Zambia 2024-10-07 3.636364 ~ Changed values: 114 / 71983 (0.16%) country date pcnt_poscombined - pcnt_poscombined + Costa Rica 2024-10-07 0.326442 0.326797 Denmark 2024-10-21 1.134791 1.140251 Indonesia 2023-08-28 43.478260 40.000000 Indonesia 2024-04-22 23.809525 24.390244 South Africa 2024-09-16 8.730159 8.800000 Legend: +New ~Modified -Removed =Identical Details Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet ``` Automatically updated datasets matching _weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk_ are not included

Edited: 2024-11-11 09:57:39 UTC Execution time: 15.04 seconds

Marigold commented 2 weeks ago

@lucasrodes could you review it please? I can't install torch on my laptop due to this issue. It's probably solvable, but I've already spent an hour on it and didn't make any progress.

pabloarosado commented 2 weeks ago

@lucasrodes could you review it please? I can't install torch on my laptop due to this issue. It's probably solvable, but I've already spent an hour on it and didn't make any progress.

Thanks Mojmir, I'm sorry about that issue, it sounds annoying! If you want I can add this app temporarily to wizard, so you can play with it (in any case I'm also happy if Lucas wants to have a look, or both).

pabloarosado commented 2 weeks ago

Hey @Marigold I've moved it to wizard, so you can try it out. But of course, if this is going to break your ETL environment, we shouldn't push it. I find it very useful, and having that library on ETL could also let us experiment with other similar things, but we can also move it to its own repos if it's problematic (or discard it if others don't find it useful, it's just an experiment). Let me know what you think, thanks.