Big refactor of the public API to generate the datasets

cuducos commented 7 years ago

This issue is proposed as a roadmap to a big refactor in the public API. This issue might also work as a whishlist for you who uses this toolbox and believe its API for generating the datasets could be improved. I'll suggest a to-do list in this opening post and try to keep it updated as the following discussion goes by.

The main problems with the current one has been discussed by @lipemorais and myself in several other issues and PRs. For example:

Problems with semantic: we only have fetch, translate and clean methods when what really happens in these three methods is fetch data, translate data to en_US, clean data, convert data from .csv to .xz, and merge datasets by year in a single file (see #53 and #68)
These methods have a lot of logic embedded which makes tests more complex than they could be (see #68)
We miss a single method to handle all tasks required to generate a dataste from the scratch (see #59)
Integration tests still depends on external server — we could work with fixtures instead (also #59)

Therefore what I propose here is to:

[ ] map the impact of changing the API
[ ] rewrite fetch, translate, clean into more atomic methods (reduce side effects), with really simple logic and adding more methods if needed (e.g. convert_to_xz, translate_to_en etc.)
[ ] add a method (e.g. generate) to handle all internal tasks from downloading it from the original source to having a dataset ready for Serenata pipeline (i.e. make all methods from the previous task internal methods used by this main one here)
[ ] write unit tests for each of this methods that do not depend on external download (using mocks)
[ ] write integration tests for this class that do not download on external download (using fixtures)

I think that this refactor will enhance our code quality and architecture and can pave the way to more overarching changes such as:

adopting Dask as a default to handle any dataset (if we have barely no side effects, this is quite easy)
changing the tests suite for something more robust such as pytest, or even using tox
opening the API for new user customization, embracing DRY and avoiding this

JoaoCarabetta commented 7 years ago

Hey,

is someone working on it? If positive, let me know!

I think that this toolbox should aim to cover all api endpoints from the Senate and Chamber of Deputies and other congress related data. I know that it seems a quite big dream. But, with that in mind, we should think on the structure of the project to easily accept new data entries in an organized way.

With a more organized way to insert new datasets, other projects could build upon this parsing structure. With more data available, political scientists and journalists can perform better and quicker analysis. Also, more correlation ideas could flourish and turn into apps from this easily available structured data.

Following this big dream, I propose one more step to this enhancement proposal:

[ ] Draw an API map where datasets are easily implemented [dev] and accessed [user]

:)

lipemorais commented 7 years ago

Hey, @JoaoCarabetta ! I'm working on it. I believe that to be able to make this we need a big unit test coverage to understand faster the impacts of each change. So to address it I'm working to have more unit tests coverage because the journey tests takes too long to give us fast feedback.

trmendes commented 7 years ago

@lipemorais @JoaoCarabetta are you guys still working on it?!

lipemorais commented 7 years ago

@trmendes Hell yeah! This week I open a PR to cover Chambers Deputies module in #124 and some other improvements around tests like #134.

Would like to help us on this?

trmendes commented 7 years ago

@lipemorais I would like to help! Learning Python here and it is nice to have a project like this one to help.

willianpaixao commented 6 years ago

While working on #199 I stumbled upon this ticket and here I want to update and make some considerations.

Regarding "rewrite fetch, translate, clean into more atomic methods, with really simple logic and adding more methods if needed" I am facing the same problem. Those methods are a bit confusing and each dataset (chamber and senate) has a different implementation. My intention is to make little more centralized (maybe with some common classes) to download and process the datasets in a more unified way. In my implementation, I'm using asyncio and it's sister libraries to make parallel processing. See my branch for more information.

Any feedback is very welcome.

cuducos commented 6 years ago

Great start — many thanks, @willianpaixao : ) I added minor comments to your WIP commit, hope they are helpful!

okfn-brasil / serenata-toolbox

Big refactor of the public API to generate the datasets #87