owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
78 stars 21 forks source link

📊 cancer: IHME data on risk factors #3210

Closed veronikasamborska1994 closed 4 weeks ago

veronikasamborska1994 commented 1 month ago

hi @spoonerf! I am adding an IHME dataset with more detailed data on cancers attributable to risk factors. I've noticed that you probably have a bug in one of your datasets - here is an example chart. So I've largely re-used your code with the exception of regional aggregations. I am happy to fix your code too but wanted to check with you first.

Also, I've noticed there are some weird data in the IHME dataset with some numbers being negative - for example here. I was wondering if you ever encountered this when you were uploading your data and know if this is just a simple error on the data provider side adding a minus when there shouldn't be one?

owidbot commented 1 month ago
Quick links (staging server): Site Admin Wizard

Login: ssh owid@staging-site-ihme-risk

chart-diff: ✅ No charts for review.
data-diff: ✅ No differences found ```diff Legend: +New ~Modified -Removed =Identical Details Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet ``` Automatically updated datasets matching _weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk_ are not included

Edited: 2024-09-04 14:10:38 UTC Execution time: 4.09 seconds

spoonerf commented 1 month ago

Hey @veronikasamborska1994,

Ahhh thanks for catching that!! Looks like I messed up all the aggregations for the 'Share of...' variables. Would be great if you could fix this, although bear in mind it will affect all the GBD datasets so will need to be merged after hours so as to not block up that ETL.

WRT the negative values, that's not something I've seen before but does look like it might be expected where risk factors have a negligible impact (based on the risk factors paper here). It might be worth digging into the Supp Mat to see if there are any more specific details regarding prostate cancer.

Can do code review tomorrow or thurs!

veronikasamborska1994 commented 1 month ago

Hey @veronikasamborska1994,

Ahhh thanks for catching that!! Looks like I messed up all the aggregations for the 'Share of...' variables. Would be great if you could fix this, although bear in mind it will affect all the GBD datasets so will need to be merged after hours so as to not block up that ETL.

WRT the negative values, that's not something I've seen before but does look like it might be expected where risk factors have a negligible impact (based on the risk factors paper here). It might be worth digging into the Supp Mat to see if there are any more specific details regarding prostate cancer.

Can do code review tomorrow or thurs!

thanks @spoonerf! looks like you are right and some "risks" are indeed protective and can lead to negative values. In the case of prostate cancer it looks like "diet low in milk" is actually protective. I'll add a note on the chart if we decide to publish it with Saloni.

I've changed the aggregations to be simple averages for "Share of" indicators for now but we might want to consider doing population weighted averages. Might be nice to merge and fix the bug first though?