scikit-learn-contrib / scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
https://scikit-matter.readthedocs.io/en/v0.2.0/
BSD 3-Clause "New" or "Revised" License
70 stars 18 forks source link

Adding a publically-available example for the who dataset #149

Closed rosecers closed 1 year ago

rosecers commented 1 year ago

A publically-available version of the examples used in the paper text

agoscinski commented 1 year ago

I think most test error go away when this branch is rebased

I don't think we can upload the dataset in our repo. There is no license information at kaggle which would allow a redistribution https://academia.stackexchange.com/a/63157

i will contact the person first, but if this does not work, I think we have to download the dataset from kaggle within the notebook to be on the safe side. Since there is no downloadable link available I think we have to use kaggle API, make it a dependency and use

kaggle datasets download -d kumarajarshi/life-expectancy-who

with some account information which we also need to include here. Not so idea, would prefer that the owner just allows us the upload.

I think the notebook are nice like they are. Only things

agoscinski commented 1 year ago

Okay, contacting people on kaggle is only possible if you have a higher tier account and for that you need to fill up you profile contribute some stuff and get upvotes. I think downloading it from kaggle with some credentials is the easier solution. I made an account with my epfl address and added the token to the yml for the test. I think the chance of abuse is very little.

What i did (should be a

ceriottm commented 1 year ago

Heya. Isn't this data taken from somewhere? Must be. If that's the case, we can also fetch it from the original source so we make it available in a more open format.

agoscinski commented 1 year ago

Yes, the specific dataset is taken from kaggle https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who You need an account to access the dataset. This PR added credentials into yml files to download it for a account I made specifically for this purpose.

it has no license attached, so redistributing this dataset seems to me troublesome. But the dataset itself is I think a merge from different WHO datasets like this one https://apps.who.int/gho/data/node.main.688 Maybe its worth to spent the time to just merge the different datasets by ourself, since WHO licence allows us a redistribution

The CC BY-NC-SA 3.0 IGO licence allows users to freely copy, reproduce, reprint, distribute, translate and adapt the work for non-commercial purposes, provided WHO is acknowledged as the source using the following suggested citation: [...]

https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-life-expectancy-and-healthy-life-expectancy EDIT: wrong link, meant https://www.who.int/about/policies/publishing/copyright

ceriottm commented 1 year ago

Uhm. I think that if they used those datasets, based on the terms of CC BY-NC-SA 3.0 IGO, the kaggle dataset should also be distributed according to CC BY-NC-SA 3.0 IGO (otherwise they're in break of the -SA provisions). I'd say that if it takes less than a couple of hours it might be better to assemble a file from the original WHO stuff (and distribute it with a clear CC BY-NC-SA 3.0 IGO licence), otherwise I'd read clearly CC BY-NC-SA 3.0 IGO and check if it implies we can reuse the kaggle assembly assuming it's also CC BY-NC-SA 3.0 IGO .

agoscinski commented 1 year ago

Ah right this should be under the ShareAlike constraint

ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

https://creativecommons.org/licenses/by-nc-sa/3.0/igo/

Some licenses can be made proprietary with inheritance, so I was not sure till now. I will check if all the data is available on WHO, and if its much effort to merge it (I guess not). Would be nicer to have it here as a dataset than downloading it, on that I agree.

agoscinski commented 1 year ago

I compared the data from the WHO website with the one from kaggle and I cant find any interpretation that make the two agree with each other (checked for Afghanistan) https://www.who.int/data/gho/data/themes/topics/indicator-groups/indicator-group-details/GHO/life-expectancy-and-healthy-life-expectancy

Looked into the dataset and it has serious issues. Look at the population of Russia

Country Year Population
Russian Federation 2014 143819666.0
Russian Federation 2013 14356911.0
Russian Federation 2012 14321676.0
Russian Federation 2011 14296868.0
Russian Federation 2010 142849449.0
Russian Federation 2009 142785342.0
Russian Federation 2008 14274235.0
Russian Federation 2007 1428588.0
Russian Federation 2006 14349528.0
Russian Federation 2005 143518523.0
Russian Federation 2004 1446754.0
Russian Federation 2003 144648257.0
Russian Federation 2002 1453646.0
Russian Federation 2001 14597683.0

We need to also redo the analysis after we made our own dataset

rosecers commented 1 year ago

@agoscinski I have updated the examples. During my rebase, I noticed a lot of documentation changes from your pushes -- do you want these in this PR?

agoscinski commented 1 year ago

The documentation changes should come from the last merged PR. We still need

agoscinski commented 1 year ago

Also when I run the notebook, I get different results than in paper. Just want to sure everything is correct. who-selection