Closed rosecers closed 1 year ago
I think most test error go away when this branch is rebased
I don't think we can upload the dataset in our repo. There is no license information at kaggle which would allow a redistribution https://academia.stackexchange.com/a/63157
i will contact the person first, but if this does not work, I think we have to download the dataset from kaggle within the notebook to be on the safe side. Since there is no downloadable link available I think we have to use kaggle API, make it a dependency and use
kaggle datasets download -d kumarajarshi/life-expectancy-who
with some account information which we also need to include here. Not so idea, would prefer that the owner just allows us the upload.
I think the notebook are nice like they are. Only things
WhoDataset-PCovR.ipynb
a faster? The "## Train the Different Kernel DR Techniques" section takes way too longOkay, contacting people on kaggle is only possible if you have a higher tier account and for that you need to fill up you profile contribute some stuff and get upvotes. I think downloading it from kaggle with some credentials is the easier solution. I made an account with my epfl address and added the token to the yml for the test. I think the chance of abuse is very little.
What i did (should be a
Heya. Isn't this data taken from somewhere? Must be. If that's the case, we can also fetch it from the original source so we make it available in a more open format.
Yes, the specific dataset is taken from kaggle https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who You need an account to access the dataset. This PR added credentials into yml files to download it for a account I made specifically for this purpose.
it has no license attached, so redistributing this dataset seems to me troublesome. But the dataset itself is I think a merge from different WHO datasets like this one https://apps.who.int/gho/data/node.main.688 Maybe its worth to spent the time to just merge the different datasets by ourself, since WHO licence allows us a redistribution
The CC BY-NC-SA 3.0 IGO licence allows users to freely copy, reproduce, reprint, distribute, translate and adapt the work for non-commercial purposes, provided WHO is acknowledged as the source using the following suggested citation: [...]
https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-life-expectancy-and-healthy-life-expectancy EDIT: wrong link, meant https://www.who.int/about/policies/publishing/copyright
Uhm. I think that if they used those datasets, based on the terms of CC BY-NC-SA 3.0 IGO, the kaggle dataset should also be distributed according to CC BY-NC-SA 3.0 IGO (otherwise they're in break of the -SA provisions). I'd say that if it takes less than a couple of hours it might be better to assemble a file from the original WHO stuff (and distribute it with a clear CC BY-NC-SA 3.0 IGO licence), otherwise I'd read clearly CC BY-NC-SA 3.0 IGO and check if it implies we can reuse the kaggle assembly assuming it's also CC BY-NC-SA 3.0 IGO .
Ah right this should be under the ShareAlike constraint
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
https://creativecommons.org/licenses/by-nc-sa/3.0/igo/
Some licenses can be made proprietary with inheritance, so I was not sure till now. I will check if all the data is available on WHO, and if its much effort to merge it (I guess not). Would be nicer to have it here as a dataset than downloading it, on that I agree.
I compared the data from the WHO website with the one from kaggle and I cant find any interpretation that make the two agree with each other (checked for Afghanistan) https://www.who.int/data/gho/data/themes/topics/indicator-groups/indicator-group-details/GHO/life-expectancy-and-healthy-life-expectancy
Looked into the dataset and it has serious issues. Look at the population of Russia
Country | Year | Population |
---|---|---|
Russian Federation | 2014 | 143819666.0 |
Russian Federation | 2013 | 14356911.0 |
Russian Federation | 2012 | 14321676.0 |
Russian Federation | 2011 | 14296868.0 |
Russian Federation | 2010 | 142849449.0 |
Russian Federation | 2009 | 142785342.0 |
Russian Federation | 2008 | 14274235.0 |
Russian Federation | 2007 | 1428588.0 |
Russian Federation | 2006 | 14349528.0 |
Russian Federation | 2005 | 143518523.0 |
Russian Federation | 2004 | 1446754.0 |
Russian Federation | 2003 | 144648257.0 |
Russian Federation | 2002 | 1453646.0 |
Russian Federation | 2001 | 14597683.0 |
We need to also redo the analysis after we made our own dataset
@agoscinski I have updated the examples. During my rebase, I noticed a lot of documentation changes from your pushes -- do you want these in this PR?
The documentation changes should come from the last merged PR. We still need
who_dataset.csv
mortality.csv
), because its a mix of the who and world bank datasetsAlso when I run the notebook, I get different results than in paper. Just want to sure everything is correct.
A publically-available version of the examples used in the paper text