thegraphnetwork / epigraphhub_py

Epigraphhub Python package
GNU General Public License v3.0
2 stars 9 forks source link

World bank data #151

Closed eduardocorrearaujo closed 2 years ago

eduardocorrearaujo commented 2 years ago

I created this PR to solve the issue #149.

I found a python package called wbgapi (https://blogs.worldbank.org/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data) that possibility the user to get the data from the world data bank.

Using the wbgapi package I created a function called get_pop_data to get the population data from the world data bank. I would like to know if that is the kind of function that you want to have in the epigraphhub_py package. If you have any other suggestions of datasets from the world data bank to implement a function let me know.

fccoelho commented 2 years ago

The module get_wbd_data.py should be renamed to worldbank.py

Always remember that names of variables, modules and packages need to be meaningful and readable. https://peps.python.org/pep-0008/#naming-conventions

dcpcamara commented 2 years ago

I liked the function, it is good to go in my opinion. My only concern is that there are 20k+ indicators in the WB database. Are you planning on making one function for each indicator? Maybe a more feasible idea would be download the dataset and leave the manipulation to the user. I tried to address this last solution in the R script (it isn't a function yet, just a demo of what can be done).

eduardocorrearaujo commented 2 years ago

Thank you for the feedback @dcpcamara. I understand your point.

In the wbgapi package, a function allows the user to search for keywords in the indicators and return the matched values using the function wb.series.info(). There is also a function that returns a data frame for a list of indicators and countries.

I can create a function that returns something similar, but I thought it would be interesting to have functions that allow us to get some useful indicators easily. These functions would be focused in the more valuable indicators for our projects. For this reason, the first indicator that I thought about was the population.

fccoelho commented 2 years ago

Regarding the large number of indicators, I agree with @eduardocorrearaujo 's approach to have some ready to use function for Commonly used indicators, but as @dcpcamara says we also need a generic implementation. in GHOclient I let the user search for the indicator by keyword, and then use a generic function thta returns the data corresponding to a specific indicator code.

eduardocorrearaujo commented 2 years ago

Regarding the large number of indicators, I agree with @eduardocorrearaujo 's approach to have some ready to use function for Commonly used indicators, but as @dcpcamara says we also need a generic implementation. in GHOclient I let the user search for the indicator by keyword, and then use a generic function thta returns the data corresponding to a specific indicator code.

Ok, I will make a generic implementation, and I'll share it with you.

eduardocorrearaujo commented 2 years ago

Hi, I added some functions to search indicators in the database and get the data to specific indicators. I also wrote an example of a documentation file for the functions that I created. I know that the documentation should be in another folder, but how Flavio created that folder after I open this PR, to have access to this folder I would need to open another PR. For this reason, to keep these similar files in the same PR, to be analyzed together, I preferred to put the documentation file here, for all take a look, and after approving the .py and the .md file, I will remove the .md file and I will make the merge and create a new PR to put the documentation in the right folder.

fccoelho commented 2 years ago

@eduardocorrearaujo if you are going to use examples from a Jupyter notebook, you can use the notebook file itself as the documentation, merging the text into the notebook as Markdown cells. You can also create hierarchies in the documentation like shown below index.rst:

Index
=====

.. toctree::

   data/index

data/index.rst

Data collection tools
======================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

   World Bank<worldbank.ipynb>
   Google Trends<trends.ipynb>
fccoelho commented 2 years ago

the .ipyb_checkpoints folder should be added to the .gitignore

eduardocorrearaujo commented 2 years ago

Thank you so much for the tips, @fccoelho. Now the code is much better.

About the creation of the unit tests, I have some doubts. I should create a function that every time is run select a random indicator, database, function, and country and so confirm that the function is working?

fccoelho commented 2 years ago

Thank you so much for the tips, @fccoelho. Now the code is much better.

About the creation of the unit tests, I have some doubts. I should create a function that every time is run select a random indicator, database, function, and country and so confirm that the function is working?

The is a nice functionality of pytest that allows us to pass different parameter values to a single unit test that broadens its hability to catch errors. Here is an example in this project.

But often this is not needed if we can write a simple test case that represents a large variety of situations.

fccoelho commented 2 years ago

I think it can be merged, but given the CI fail, I think wbgapi needs to be added with poetry add wbgapi . @eduardocorrearaujo it would be nice to have some test coverage. Try to add at least a couple of tests and we can increase the coverage later.