okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Refactoring federal senate dataset #68

Closed lipemorais closed 7 years ago

lipemorais commented 7 years ago

This is in progress, I just created it earlier to have earlier feedbacks.

What is the purpose of this Pull Request?

The purpose of this PR is make some improvements on tests and code of federeal_senate_dataset.

What was done to achieve this purpose?

I created a new kind of test called journey test to test the real workflow even taking more time. Would like to receive some feedbacks to check if it's good or not for the project. At the first time that I executed it, took more than 8 minutes now it takes almost 5:30 minutes. Se output below:

(.toolbox-env) ➜  fork-toolbox git:(master) ✗ python -m unittest tests/journey/test_federal_senate_dataset.py
.
----------------------------------------------------------------------
Ran 1 test in 338.959s

OK
(.toolbox-env) ➜  fork-toolbox git:(master) ✗

I also added a mocked unit test to improve the feedback cycle.

How to test if it really works?

To run the journey test you can use the code below: _Do not forget to export env var RUN_INTEGRATIONTESTS with value 1 to do not have this test skipped

python -m unittest tests/journey/test_federal_senate_dataset.py

To run the unit test use this:

python -m unittest tests/test_federal_senate_dataset.py

Who can help reviewing it?

@anaschwendler is the person with more knowledge about this code, her feedbacks will be very helpful. Other people that can help are @Irio @jtemporal @cuducos. If you are not one of them feel free to provide me some feedbacks.

TODO

Nice To Have

jtemporal commented 7 years ago

This is in progress

WIP label added, we can take it out whenever just ping us here =)

anaschwendler commented 7 years ago

Hi @lipemorais ! Thank you very much for this refactor, I'll be looking for that closely :)

In my opinion the only thing we have to keep, is the way that things works and some structures to keep everything working on the other places: rosie and serenata-de-amor.

So we need to keep the main methods: fetch(), translate() and clean() with its features. Everything besides that can be changed!

:tada:

cuducos commented 7 years ago

So we need to keep the main methods: fetch(), translate() and clean() with its features. Everything besides that can be changed!

As I've said elsewhere this API is terrible. translate is translating labels and changing the file extensions, clean is cleaning data and merging things. I definitively think this is a terrible API. It's confusing, it's not semantically meaningful, it's hard to test and hard to maintain.

I think TDD can help us to restructure this API, working with stream of data, reducing temporary intermediary files (like federal_senate-2008.xz), so we fetch original files, do all treatment needed and finally write the results. If memory is an issue this an awesome opportunity to bring Dask in — and get another leverage: to risk get rid of Rosie's memory issues.

@anaschwendler can you list exactly what we're gonna break if we change the input and output of these methods?

anaschwendler commented 7 years ago

So, @cuducos Hm, it is ok to refactor all the API, but we may need some time to reflact it in the other repositories, I can help with that and we need to do it very carefully.

@anaschwendler can you list exactly what we're gonna break if we change the input and output of these methods?

For the Federal Senate we can work on it, because it is not on production, but for chamber_of_deputies it changes all its structure.

and get another leverage: to risk get rid of Rosie's memory issues.

For me it looks perfect, and we can work together on it.

cuducos commented 7 years ago

Yay! Next week I can start working on that. If anyone is willing to study how embedded are those methods in the other repos architecture, that would be awesome. Specially translate and clean — fetch is good so far IMHO.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling ae7ff4ba299157e4aa9aa1f617268bf75a7ba214 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling 60800ad9c0560a1b70288268ba55940deee4d987 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling fa881825285143e27e74ed91172f0365afa1f925 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling c618acafd7b682098d2800272acafc727ea0d87d on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling 1a89abd4c4067c7ff9c2690b80b5f39911c0da35 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling f182b3e15a311ca249843d663a9bbb6edea3a394 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling a6e5b0797c1cd59abed063a96886d7f9ab0005f4 on lipemorais:master into on datasciencebr:master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling c8f8a51e389decae3bea0f8a298719607181668d on lipemorais:master into on datasciencebr:master.

jtemporal commented 7 years ago

@lipemorais In light of the last commits I wonder if wouldn't be wiser to generate the .xz from the .csv instead of having to commit them as fixtures?

lipemorais commented 7 years ago

@jtemporal it would be good but make the coupling between the tests biggers, something like clean will just work after translate has runned. This would let the tests more slow and not so unit as it should be.

In order to have less fixtures I'm trying to make the tests uses less fixtures as possible, I will commit it soon.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling e4cac56457daa9253cb7e3e828229852dd98afc5 on lipemorais:master into on datasciencebr:master.

lipemorais commented 7 years ago

@anaschwendler @jtemporal @cuducos I need some help to fix to understand why is this test failing in Travis CI. :( Failed buil in Travis

This is my last test for this PR.

The next step are more nice to haves

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling ba106a9844f5b79251ee76074cc76bf4c3699859 on lipemorais:master into on datasciencebr:master.

cuducos commented 7 years ago
No such file or directory: 'tests/fixtures/csv/federal-senate-2008.xz'

I think it fails because when you instantiate something like FederalSenateDataset('tests/fixtures/csv/', …) you won't be able to access any .xz (because they're in tests/fixtures/xz/).

lipemorais commented 7 years ago

@cuducos Actually when you call translate it generates the file with .xz extension. If you run it locally you will see it.

cuducos commented 7 years ago

@cuducos Actually when you call translate it generates the file with .xz extension. If you run it locally you will see it.

I know. But does it generates the .xz in which folder: tests/fixtures/csv/ or tests/fixtures/xz/? That's my point.

lipemorais commented 7 years ago

It generates .xz file in csv folder: tests/fixtures/csv/.

cuducos commented 7 years ago

It generates .xz file in csv folder: tests/fixtures/csv/.

Hum… confusing, but I got it. I have the impression we could abuse of mocks to avoid that mess in terms of file system. Unit tests to test small atomic stuff, and one integration test that would do everything form scratch… 

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling a108c807abb54fb1575482bdf4501acb1a9fcee8 on lipemorais:master into on datasciencebr:master.

lipemorais commented 7 years ago

Hum… confusing, but I got it.

Yeah, it's confusing. For the first step of this refactoring I trying to isolate network tests, a next step would be do the same with file handling tests.

I'm worried about works in my machine problem, I do not find a simple way to reproduce what is happening in Travis CI. It will be necessary soon or later, would be good be able to reproduce it.

I do like the idea of use mocks, it will make the test better. Could you talk a little more about it?

cuducos commented 7 years ago

Could you a little more about it?

Some examples:

This way we don't need fixtures at all for unit tests. And for integration tests we actually download data and check if the journey succeeds from the scratch.

Does that make sense?

jtemporal commented 7 years ago

Also minor comment on paths: Try to use the os.path.join to build the paths strings and avoid breaking things on windows ;)

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling 63292b8188cf318eb603a1b2b3d8cf09502d3515 on lipemorais:master into on datasciencebr:master.

lipemorais commented 7 years ago

Does that make sense? @cuducos I do agree with everything that you just said.

fetch: we don't need to actually run urlretrieve, we assume urlretrieve just works (Guido must have tested it hahaha) and must assert it was called x times with the proper arguments — so we mocked it and check for these calls, not for files written in the file system

Actually this is done in tests/test_federal_senate_dataset.py:18

translate: we test __translate_file with a single unit test and them in the translate method what is relevant is to know if self.__translate_file was called x times with the proper argument

This is a great idea but now that translate test is working I think that this can wait a little.

__translate_file: we mock read_csv setting a pandas DataFrame we build in tests (in memory, not as a file) as a return_value and check if its keys/headers are actually translated

Agree with it. We can create a issue just refactor translate. What do you think?

This way we don't need fixtures at all for unit tests. And for integration tests we actually download data and check if the journey succeeds from the scratch.

With less fixtures we will have tests running even faster but since at the beginning was necessary 8:30 minutes and now we can run it in 6:32 seconds. I would like to merge this PR and let more improvements for a next one.

@jtemporal Thank you for your feedback. Where could I apply os.path.join in this PR to keep this code Windows friendly?

To finish the mystery. Travis was right, I made a mistake during a refactor when I changed a variable name. Here is the commit that makes it work again. Travis build IT WORKS!!!

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.8%) to 25.818% when pulling e3b5e06519ceaeca26095f5a05d3f4d3e7d77290 on lipemorais:master into ae36f97f40c90588e4ea160ed004e5032f7bcde0 on datasciencebr:master.

jtemporal commented 7 years ago

If you run tests on windows the ones that have hardcoded path will fail =)

Where could I apply os.path.join in this PR to keep this code Windows friendly?

so, wherever there is a path ;)

cuducos commented 7 years ago

os.path.join FTW ; )

lipemorais commented 7 years ago

@jtemporal Thank you for feedbacks, I made the changes that you suggested and would like to see if now it's ok for you.

gustavo-freitas commented 7 years ago

@lipemorais @cuducos @jtemporal hello everybody! I'm totally new here! @lipemorais can you point me out exactly where is the journey test that takes around 6/7 minutes to run?

lipemorais commented 7 years ago

Hey, @gustavo-freitas ! I'm glad to have you here.

This is the journey test

To make it run use: Obs.: Python3 is expected and may some dependencies needed to run this test.

python -m unittest tests/journey/test_federal_senate_dataset.py
lipemorais commented 7 years ago

@jtemporal I made the fixes that you suggested and opened a issue to make it work for the whole project too. Thank you for your feedbacks.

cuducos commented 7 years ago

Ok… for me just one last command: we don't have fixtures in the test/fixtures directory. We actually have full datasets there. Due to the strategy adopted in this PR I understand we need these files there. But this is bizarre:

$ du -h -d 1
 15M    ./.git
 44K    ./docs
280K    ./htmlcov
256K    ./serenata_toolbox
 20K    ./serenata_toolbox.egg-info
 51M    ./tests
 67M    .

What I mean is: the whole thing excluding .git/ has 52Mb. 98% of the repo is in the tests/fixtures/ directory:

$ du -h -d 1 tests/fixtures
 47M    tests/fixtures/csv
3.9M    tests/fixtures/xz
 51M    tests/fixtures

IMHO we can be more concise when it comes to fixtures. We can:

  1. edit fixtures to have 2 lines each, for example (or more only if we have a plausible reason to add more)
  2. edit test routines to need only data from 2 years (e.g. 2008 and 2009)

This seams enough for testing purposes… what do you think @jtemporal?

cuducos commented 7 years ago

Sorry… that wasn't my last comment. The test suit is leaving a residual file behind, we need to fix that before merging:

~/serenata-toolbox(lipemorais-master) » git status                                                                       
On branch lipemorais-master
nothing to commit, working tree clean

~/serenata-toolbox(lipemorais-master) » python -m unittest discover tests                                                
s..........................Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.You need an Amazon section in config.ini to interact with S3 (Check config.ini.example if you need a reference.)
.Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.....It looks like you have an old version of the config.ini file. We do not need anymore the service (s3) appended to the region (sa-east-1). Please update your config.ini replacing regions like `s3-sa-east-1` by `sa-east-1`.
.....While translating Seranata Toolbox not found file: tests/fixtures/csv/federal-senate-2007.csv
File b'tests/fixtures/csv/federal-senate-2007.csv' does not exist
..While fetching Seranata Toolbox not found file: /var/folders/lr/gd0yq9m90rv_gz682gqt7ylc0000gn/T/federal-senate-2007.csv
HTTP Error 404: Not Found
..Loading receipts.csv…
Dropping rows without document_value or reimbursement_number…
Grouping dataset by applicant_id, document_id and year…
Gathering all reimbursement numbers together…
Summing all net values together…
Summing all reimbursement values together…
Generating the new dataset…
.
----------------------------------------------------------------------
Ran 45 tests in 5.566s

OK (skipped=1)

~/serenata-toolbox(lipemorais-master*) » git status                                                                      
On branch lipemorais-master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

    tests/fixtures/csv/federal-senate-2008.xz

nothing added to commit but untracked files present (use "git add" to track)

~/serenata-toolbox(lipemorais-master*) »
cuducos commented 7 years ago

In sum:

cuducos commented 7 years ago

In the aforementioned branch I might have made things worse. Two residual changes after testing:

On branch cuducos-lipedemorais-master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   tests/fixtures/xz/federal-senate-reimbursements.xz

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    tests/fixtures/csv/federal-senate-2008.xz

no changes added to commit (use "git add" and/or "git commit -a")
cuducos commented 7 years ago

In 704f675 you removed some fixtures… are you shrinking the remaining ones to a couple of lines? The ones from my branch have 2 reimbursements each, pretty compact — feel free to grab them ; )

jtemporal commented 7 years ago

edit fixtures to have 2 lines each, for example (or more only if we have a plausible reason to add more)

I agree with @cuducos on this. If you want to test for all of the years reducing the fixtures to the minimum amount of lines in each would be the way to go IMHO

lipemorais commented 7 years ago

@jtemporal @cuducos I completely agree, I removed some datasets that are not necessary anymore but there is space for improvement. I'm planing to fix this soon.

lipemorais commented 7 years ago

What's really missing is to address the residual files… this is still an issue, ok?

OK. I working on it.

I addressed the name convention here 0dfb560cae880b836d15b926712e1640747f4390

In 704f675 you removed some fixtures… are you shrinking the remaining ones to a couple of lines? The ones from my branch have 2 reimbursements each, pretty compact — feel free to grab them ; )

I will grab just the idea, because the encoding is different. ;)

lipemorais commented 7 years ago

@cuducos and @jtemporal I tunned the fixtures as much as I could to make it very fast eliminating generated files after somo tests that needed it.

Do you see any space for improvements?

jtemporal commented 7 years ago

I believe now it is only missing a version bump as @anaschwendler kindly reminded me earlier =)

cuducos commented 7 years ago

Hell yeah! Many thanks @lipemorais for that improvements and @jtemporal and @anaschwendler for the support and code reviews 🎉

cuducos commented 7 years ago

I believe now it is only missing a version bump as @anaschwendler kindly reminded me earlier =)

Done.