Changing reimbursement data collection to conform new version

luizcavalcanti commented 7 years ago

The brazilian congress made available a new version of the reimbursement dataset, which now can be downloaded already in csv format and separated by year. This somewhat simplifies the current fetching routine, but it also demands a lot of small changes in the way the dataset is aquired, translated, loaded and merged.

I this rather large commit, the unit tests were also modified to conform to the new way of fetching, and the "convert to csv" test was removed, since this operation is not needed anymore. The convert_to_csv() mehtod from the CEAPDataset class was kept, though, so we don't break rosie until it removes the call to this method.

Another quite drastic change was made in the test cases' names. A number was added to them, so we can 'make sure' a proper order is followed during integration tests. I strongly believe there is a better way of doing this. I can work on it in a near-future commit.

Signed-off-by: Luiz Carlos Cavalcanti cavalcanti.luiz@gmail.com

luizcavalcanti commented 7 years ago

This change is part of something I was talking to @cuducos on the project's telegram channel. The idea, as explained in the commit message, is to conform to the new data format for reimbursements. Besides the obvious XML->CSV change, there were little annoyances that had to be corrected to make it work. The data is not perfect, there are entries that have more columns than defined originally (specially in 2009 data), and I have no idea if congress IT people will correct it someday, so a couple of panda parameters were used to make the data loading process more lenient.

cuducos commented 7 years ago

Ok… I just took a look (haven't ran any code yet, just read through it). Right now I have two main concerns:

Can you document (here i the PR) the difference in terms of data structure? I mean, if they are just missing fields, I think there's no problem, but earlier we already have to deal with different subquota numbers/names etc. This should be done very carefully because we do use some of this data for the classifiers.
After reading your considerations on tests I started to question the structure of the test case: IMHO it's actually an integration test (it tests bunches of things at once, reading/writing to the filesystem, for example) but they are structured as unit tests (test x, then test y, then test z). It makes no sense. I would like to suggest a single test method telling a story, a proper integration test (example below).

Just drafting, but anyway… ```python class TestFetchreimbursements(TestCase): def test_fetch_reimbursements(self): # let's create a temp dir for the sake of this tests … # now let's make sure (aka asset) the dir has no dataset … # now we download the datasets were properly saved … # assert column names are in pt-br # translate and assert they are in en-us … # let's group the reimbursements (assert it, and that files by years are gone) … ```

luizcavalcanti commented 7 years ago

Sure.

About the changes in data structures, only some column names had changed to camel-case (some didn't). Since we were performing case-sensitive filtering on those, I updated them. I didn't notice any difference in categories (except they corrected a extra blank space in one of them) or their ids. The biggest weirdness I faced was some lines in 2009 data that had 31 columns instead of 29 (a error_bad_lines=False was added on load_csv() to discard those lines).

Regarding tests, I completely agree it's a little wrong (:P) the way it is. I just didn't feel this was something to amend on this specific commit, but I'd say we should do it on this PR or in another one, soon enough.

jtemporal commented 7 years ago

As requested WIP label removed ;)

luizcavalcanti commented 7 years ago

Can you guys please put it again under "Work in Progress" label? Congress changed the data structure, apparently. Tests are breaking again, several years have missing or changed categories (subquota description). I'll try to understand whats going on, if its temporary and try to make it more robust to those glitches/changes.

cuducos commented 7 years ago

Can you guys please put it again under "Work in Progress" label?

Done!

luizcavalcanti commented 7 years ago

There was a change made on how subquota descriptions were translated. I don't know how pythonic it is, but it's fast enough, and most importantly, it does not break if something changes in the backend (descriptions changed and new subquotas were added in the past, and probably will again).

luizcavalcanti commented 7 years ago

I did no notebook on this, only compared with interactive python and some bash utils, if that count as analysis. In short, I found out that the data is the same between the API versions.

About testing, what kind of tests do you foresee in this case? Is it worth to validate the translation itself? Or even make tests fail if there is new subquota categories?

cuducos commented 7 years ago

I did no notebook on this, only compared with interactive python and some bash utils, if that count as analysis. In short, I found out that the data is the same between the API versions.

I encourage you to do either share this analysis here (as a comment) or, even better: once this is merged, add a notebook to serenata-de-amor repo with the comparison between the two latest versions of reimbursements.xz (yours and current latest).

About testing, what kind of tests do you foresee in this case? Is it worth to validate the translation itself? Or even make tests fail if there is new subquota categories?

I would use unittest.mock to avoid I/O basically. I don't need to actually run, for example, urlretrieve — I need to be sure it was called with the proper arguments. This means we don't depend on file system (mocking read/write methods) and network (mocking requests), i.e. faster tests, and tests restricted to what actually need to be tested.

These mocks are not trivial, so I don't push everyone to write them — unless anyone is up for pairing on that one of these days ; )

luizcavalcanti commented 7 years ago

Looking into all that, @cuducos.

Just to make it public, found out yesterday that the congress is not sending the expenses issue date on the 2.0 dataset. PR on hold until they fix it. A complaint was filed to them, I'll post you guys and gals, once they do something about it.

luizcavalcanti commented 7 years ago

Don't mind me, just rebasing with master to keep things saner when/if we ever merge this :P

cuducos commented 7 years ago

Any news from them related to the missing issue-date?

luizcavalcanti commented 7 years ago

The complaint is getting several internal updates by the Camara staff over the last week, but it's invisible to me what those updates are. I can only hope they are talking about it and expect some public update soon. screen shot 2017-05-08 at 18 36 52

Irio commented 7 years ago

@luizcavalcanti Can you send me (privately is ok) the exact message you sent, with contact id? I can forward it directly to the team responsible for open data in the Chamber of Deputies.

luizcavalcanti commented 7 years ago

Protocol: 170424-000119

"Gostaria de comunicar que a versão 2.0 dos arquivos de prestação de contas da cota para exercício de atividade parlamentar (CEAP), disponível em http://www2.camara.leg.br/transparencia/cota-para-exercicio-da-atividade-parlamentar/dados-abertos-cota-parlamentar, as datas de emissão (campo datEmissao) estão vindo vazias em todos os formatos (xml, csv, json e xlsx).

Estes dados estão disponíveis na versão 1.0 (AnoAtual.zip e AnosAnteriores.zip), mas não na versão 2.0. É importante que isso seja ajustado para que possamos usar o novo formato de acesso aos dados da CEAP, que é muito mais eficiente que o anterior."

coveralls commented 7 years ago

Coverage increased (+0.1%) to 24.957% when pulling 48b720d01f52cfb4c6bd620222a26d907710d771 on luizcavalcanti:luizcavalcanti-new-congress-data-format into 15aaa6e3fbeee1524a9375585cb9d1362cb16719 on datasciencebr:master.

luizcavalcanti commented 7 years ago

A jupyter notebook was made to validate this PR. It has its own PR: https://github.com/datasciencebr/serenata-de-amor/pull/241

coveralls commented 7 years ago

Coverage increased (+0.1%) to 24.975% when pulling 5581c6e5c748b7d287560296a0e6a2edcd13e7e3 on luizcavalcanti:luizcavalcanti-new-congress-data-format into 15aaa6e3fbeee1524a9375585cb9d1362cb16719 on datasciencebr:master.

jtemporal commented 7 years ago

just passing by to mention that is required a version bump here

cc @cuducos

cuducos commented 7 years ago

just passing by to mention that is required a version bump here

Thanks! I was so enthusiastic about it I was already forgetting… @luizcavalcanti in setup.py you can make an micro version bump (if #71 is merged soon with 10.0.2 yours could be 10.0.3 I guess).

Also if you could check merge conflicts it helps a lot ; ) Otherwise I'll sort that later ; )

coveralls commented 7 years ago

Coverage increased (+0.1%) to 24.975% when pulling 623bbf610f15811abee0befc03f5ea53e84ec00a on luizcavalcanti:luizcavalcanti-new-congress-data-format into 5861d258327931c979685683ae644c7e226b059a on datasciencebr:master.

luizcavalcanti commented 7 years ago

Version bumped and whole branch rebased to current master

cuducos commented 7 years ago

Yay… that's a great contribution, @luizcavalcanti! Many thanks

okfn-brasil / serenata-toolbox

Changing reimbursement data collection to conform new version #36