Closed lipemorais closed 7 years ago
This is in progress
WIP label added, we can take it out whenever just ping us here =)
Hi @lipemorais ! Thank you very much for this refactor, I'll be looking for that closely :)
In my opinion the only thing we have to keep, is the way that things works and some structures to keep everything working on the other places: rosie
and serenata-de-amor
.
So we need to keep the main methods: fetch()
, translate()
and clean()
with its features. Everything besides that can be changed!
:tada:
So we need to keep the main methods:
fetch()
,translate()
andclean()
with its features. Everything besides that can be changed!
As I've said elsewhere this API is terrible. translate
is translating labels and changing the file extensions, clean
is cleaning data and merging things. I definitively think this is a terrible API. It's confusing, it's not semantically meaningful, it's hard to test and hard to maintain.
I think TDD can help us to restructure this API, working with stream of data, reducing temporary intermediary files (like federal_senate-2008.xz
), so we fetch original files, do all treatment needed and finally write the results. If memory is an issue this an awesome opportunity to bring Dask in — and get another leverage: to risk get rid of Rosie's memory issues.
@anaschwendler can you list exactly what we're gonna break if we change the input and output of these methods?
So, @cuducos Hm, it is ok to refactor all the API, but we may need some time to reflact it in the other repositories, I can help with that and we need to do it very carefully.
@anaschwendler can you list exactly what we're gonna break if we change the input and output of these methods?
For the Federal Senate we can work on it, because it is not on production, but for chamber_of_deputies
it changes all its structure.
and get another leverage: to risk get rid of Rosie's memory issues.
For me it looks perfect, and we can work together on it.
Yay! Next week I can start working on that. If anyone is willing to study how embedded are those methods in the other repos architecture, that would be awesome. Specially translate
and clean
— fetch
is good so far IMHO.
@lipemorais In light of the last commits I wonder if wouldn't be wiser to generate the .xz
from the .csv
instead of having to commit them as fixtures?
@jtemporal it would be good but make the coupling between the tests biggers, something like clean will just work after translate has runned. This would let the tests more slow and not so unit as it should be.
In order to have less fixtures I'm trying to make the tests uses less fixtures as possible, I will commit it soon.
@anaschwendler @jtemporal @cuducos I need some help to fix to understand why is this test failing in Travis CI. :( Failed buil in Travis
This is my last test for this PR.
The next step are more nice to haves
No such file or directory: 'tests/fixtures/csv/federal-senate-2008.xz'
I think it fails because when you instantiate something like FederalSenateDataset('tests/fixtures/csv/', …)
you won't be able to access any .xz
(because they're in tests/fixtures/xz/
).
@cuducos Actually when you call translate it generates the file with .xz extension. If you run it locally you will see it.
@cuducos Actually when you call translate it generates the file with .xz extension. If you run it locally you will see it.
I know. But does it generates the .xz
in which folder: tests/fixtures/csv/
or tests/fixtures/xz/
? That's my point.
It generates .xz
file in csv folder: tests/fixtures/csv/
.
It generates
.xz
file in csv folder:tests/fixtures/csv/
.
Hum… confusing, but I got it. I have the impression we could abuse of mocks to avoid that mess in terms of file system. Unit tests to test small atomic stuff, and one integration test that would do everything form scratch…
Hum… confusing, but I got it.
Yeah, it's confusing. For the first step of this refactoring I trying to isolate network tests, a next step would be do the same with file handling tests.
I'm worried about works in my machine problem
, I do not find a simple way to reproduce what is happening in Travis CI. It will be necessary soon or later, would be good be able to reproduce it.
I do like the idea of use mocks, it will make the test better. Could you talk a little more about it?
Could you a little more about it?
Some examples:
fetch
: we don't need to actually run urlretrieve
, we assume urlretrieve
just works (Guido must have tested it hahaha) and must assert it was called x times with the proper arguments — so we mocked it and check for these calls, not for files written in the file systemtranslate
: we test __translate_file
with a single unit test and them in the translate
method what is relevant is to know if self.__translate_file
was called x times with the proper argument__translate_file
: we mock read_csv
setting a pandas DataFrame we build in tests (in memory, not as a file) as a return_value
and check if its keys/headers are actually translatedThis way we don't need fixtures at all for unit tests. And for integration tests we actually download data and check if the journey succeeds from the scratch.
Does that make sense?
Also minor comment on paths: Try to use the os.path.join
to build the paths strings and avoid breaking things on windows ;)
Does that make sense? @cuducos I do agree with everything that you just said.
fetch: we don't need to actually run urlretrieve, we assume urlretrieve just works (Guido must have tested it hahaha) and must assert it was called x times with the proper arguments — so we mocked it and check for these calls, not for files written in the file system
Actually this is done in tests/test_federal_senate_dataset.py:18
translate: we test __translate_file with a single unit test and them in the translate method what is relevant is to know if self.__translate_file was called x times with the proper argument
This is a great idea but now that translate test is working I think that this can wait a little.
__translate_file: we mock read_csv setting a pandas DataFrame we build in tests (in memory, not as a file) as a return_value and check if its keys/headers are actually translated
Agree with it. We can create a issue just refactor translate. What do you think?
This way we don't need fixtures at all for unit tests. And for integration tests we actually download data and check if the journey succeeds from the scratch.
With less fixtures we will have tests running even faster but since at the beginning was necessary 8:30 minutes and now we can run it in 6:32 seconds. I would like to merge this PR and let more improvements for a next one.
@jtemporal Thank you for your feedback. Where could I apply os.path.join
in this PR to keep this code Windows friendly?
To finish the mystery. Travis was right, I made a mistake during a refactor when I changed a variable name. Here is the commit that makes it work again. Travis build IT WORKS!!!
os.path.join
FTW ; )
@jtemporal Thank you for feedbacks, I made the changes that you suggested and would like to see if now it's ok for you.
@lipemorais @cuducos @jtemporal hello everybody! I'm totally new here! @lipemorais can you point me out exactly where is the journey test that takes around 6/7 minutes to run?
Hey, @gustavo-freitas ! I'm glad to have you here.
This is the journey test
To make it run use: Obs.: Python3 is expected and may some dependencies needed to run this test.
python -m unittest tests/journey/test_federal_senate_dataset.py
@jtemporal I made the fixes that you suggested and opened a issue to make it work for the whole project too. Thank you for your feedbacks.
Ok… for me just one last command: we don't have fixtures
in the test/fixtures
directory. We actually have full datasets there. Due to the strategy adopted in this PR I understand we need these files there. But this is bizarre:
$ du -h -d 1
15M ./.git
44K ./docs
280K ./htmlcov
256K ./serenata_toolbox
20K ./serenata_toolbox.egg-info
51M ./tests
67M .
What I mean is: the whole thing excluding .git/
has 52Mb. 98% of the repo is in the tests/fixtures/
directory:
$ du -h -d 1 tests/fixtures
47M tests/fixtures/csv
3.9M tests/fixtures/xz
51M tests/fixtures
IMHO we can be more concise when it comes to fixtures. We can:
This seams enough for testing purposes… what do you think @jtemporal?
Sorry… that wasn't my last comment. The test suit is leaving a residual file behind, we need to fix that before merging:
~/serenata-toolbox(lipemorais-master) » git status
On branch lipemorais-master
nothing to commit, working tree clean
~/serenata-toolbox(lipemorais-master) » python -m unittest discover tests
s..........................Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.You need an Amazon section in config.ini to interact with S3 (Check config.ini.example if you need a reference.)
.Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.Could not find config.ini file.
You need Amazon section in it to interact with S3
(Check config.ini.example if you need a reference.)
.....It looks like you have an old version of the config.ini file. We do not need anymore the service (s3) appended to the region (sa-east-1). Please update your config.ini replacing regions like `s3-sa-east-1` by `sa-east-1`.
.....While translating Seranata Toolbox not found file: tests/fixtures/csv/federal-senate-2007.csv
File b'tests/fixtures/csv/federal-senate-2007.csv' does not exist
..While fetching Seranata Toolbox not found file: /var/folders/lr/gd0yq9m90rv_gz682gqt7ylc0000gn/T/federal-senate-2007.csv
HTTP Error 404: Not Found
..Loading receipts.csv…
Dropping rows without document_value or reimbursement_number…
Grouping dataset by applicant_id, document_id and year…
Gathering all reimbursement numbers together…
Summing all net values together…
Summing all reimbursement values together…
Generating the new dataset…
.
----------------------------------------------------------------------
Ran 45 tests in 5.566s
OK (skipped=1)
~/serenata-toolbox(lipemorais-master*) » git status
On branch lipemorais-master
Untracked files:
(use "git add <file>..." to include in what will be committed)
tests/fixtures/csv/federal-senate-2008.xz
nothing added to commit but untracked files present (use "git add" to track)
~/serenata-toolbox(lipemorais-master*) »
In sum:
In the aforementioned branch I might have made things worse. Two residual changes after testing:
On branch cuducos-lipedemorais-master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: tests/fixtures/xz/federal-senate-reimbursements.xz
Untracked files:
(use "git add <file>..." to include in what will be committed)
tests/fixtures/csv/federal-senate-2008.xz
no changes added to commit (use "git add" and/or "git commit -a")
In 704f675 you removed some fixtures… are you shrinking the remaining ones to a couple of lines? The ones from my branch have 2 reimbursements each, pretty compact — feel free to grab them ; )
edit fixtures to have 2 lines each, for example (or more only if we have a plausible reason to add more)
I agree with @cuducos on this. If you want to test for all of the years reducing the fixtures to the minimum amount of lines in each would be the way to go IMHO
@jtemporal @cuducos I completely agree, I removed some datasets that are not necessary anymore but there is space for improvement. I'm planing to fix this soon.
What's really missing is to address the residual files… this is still an issue, ok?
OK. I working on it.
I addressed the name convention here 0dfb560cae880b836d15b926712e1640747f4390
In 704f675 you removed some fixtures… are you shrinking the remaining ones to a couple of lines? The ones from my branch have 2 reimbursements each, pretty compact — feel free to grab them ; )
I will grab just the idea, because the encoding is different. ;)
@cuducos and @jtemporal I tunned the fixtures as much as I could to make it very fast eliminating generated files after somo tests that needed it.
Do you see any space for improvements?
I believe now it is only missing a version bump as @anaschwendler kindly reminded me earlier =)
Hell yeah! Many thanks @lipemorais for that improvements and @jtemporal and @anaschwendler for the support and code reviews 🎉
I believe now it is only missing a version bump as @anaschwendler kindly reminded me earlier =)
Done.
This is in progress, I just created it earlier to have earlier feedbacks.
What is the purpose of this Pull Request?
The purpose of this PR is make some improvements on tests and code of federeal_senate_dataset.
What was done to achieve this purpose?
I created a new kind of test called journey test to test the real workflow even taking more time. Would like to receive some feedbacks to check if it's good or not for the project. At the first time that I executed it, took more than 8 minutes now it takes almost 5:30 minutes. Se output below:
I also added a mocked unit test to improve the feedback cycle.
How to test if it really works?
To run the journey test you can use the code below: _Do not forget to export env var RUN_INTEGRATIONTESTS with value 1 to do not have this test skipped
To run the unit test use this:
Who can help reviewing it?
@anaschwendler is the person with more knowledge about this code, her feedbacks will be very helpful. Other people that can help are @Irio @jtemporal @cuducos. If you are not one of them feel free to provide me some feedbacks.
TODO
Nice To Have
fetch, translate, clean
tofetch, translate, clean, convert, merge