Open cuducos opened 7 years ago
Hey,
is someone working on it? If positive, let me know!
I think that this toolbox should aim to cover all api endpoints from the Senate and Chamber of Deputies and other congress related data. I know that it seems a quite big dream. But, with that in mind, we should think on the structure of the project to easily accept new data entries in an organized way.
With a more organized way to insert new datasets, other projects could build upon this parsing structure. With more data available, political scientists and journalists can perform better and quicker analysis. Also, more correlation ideas could flourish and turn into apps from this easily available structured data.
Following this big dream, I propose one more step to this enhancement proposal:
:)
Hey, @JoaoCarabetta ! I'm working on it. I believe that to be able to make this we need a big unit test coverage to understand faster the impacts of each change. So to address it I'm working to have more unit tests coverage because the journey tests takes too long to give us fast feedback.
@lipemorais @JoaoCarabetta are you guys still working on it?!
@trmendes Hell yeah! This week I open a PR to cover Chambers Deputies module in #124 and some other improvements around tests like #134.
Would like to help us on this?
@lipemorais I would like to help! Learning Python here and it is nice to have a project like this one to help.
While working on #199 I stumbled upon this ticket and here I want to update and make some considerations.
Regarding "rewrite fetch, translate, clean into more atomic methods, with really simple logic and adding more methods if needed" I am facing the same problem. Those methods are a bit confusing and each dataset (chamber and senate) has a different implementation. My intention is to make little more centralized (maybe with some common classes) to download and process the datasets in a more unified way. In my implementation, I'm using asyncio
and it's sister libraries to make parallel processing. See my branch for more information.
Any feedback is very welcome.
Great start — many thanks, @willianpaixao : ) I added minor comments to your WIP commit, hope they are helpful!
This issue is proposed as a roadmap to a big refactor in the public API. This issue might also work as a whishlist for you who uses this toolbox and believe its API for generating the datasets could be improved. I'll suggest a to-do list in this opening post and try to keep it updated as the following discussion goes by.
The main problems with the current one has been discussed by @lipemorais and myself in several other issues and PRs. For example:
fetch
,translate
andclean
methods when what really happens in these three methods is fetch data, translate data to en_US, clean data, convert data from.csv
to.xz
, and merge datasets by year in a single file (see #53 and #68)Therefore what I propose here is to:
fetch
,translate
,clean
into more atomic methods (reduce side effects), with really simple logic and adding more methods if needed (e.g.convert_to_xz
,translate_to_en
etc.)generate
) to handle all internal tasks from downloading it from the original source to having a dataset ready for Serenata pipeline (i.e. make all methods from the previous task internal methods used by this main one here)I think that this refactor will enhance our code quality and architecture and can pave the way to more overarching changes such as:
pytest
, or even usingtox