Closed luizcavalcanti closed 7 years ago
This change is part of something I was talking to @cuducos on the project's telegram channel. The idea, as explained in the commit message, is to conform to the new data format for reimbursements. Besides the obvious XML->CSV change, there were little annoyances that had to be corrected to make it work. The data is not perfect, there are entries that have more columns than defined originally (specially in 2009 data), and I have no idea if congress IT people will correct it someday, so a couple of panda parameters were used to make the data loading process more lenient.
Ok… I just took a look (haven't ran any code yet, just read through it). Right now I have two main concerns:
Sure.
About the changes in data structures, only some column names had changed to camel-case (some didn't). Since we were performing case-sensitive filtering on those, I updated them. I didn't notice any difference in categories (except they corrected a extra blank space in one of them) or their ids. The biggest weirdness I faced was some lines in 2009 data that had 31 columns instead of 29 (a error_bad_lines=False was added on load_csv() to discard those lines).
Regarding tests, I completely agree it's a little wrong (:P) the way it is. I just didn't feel this was something to amend on this specific commit, but I'd say we should do it on this PR or in another one, soon enough.
As requested WIP label removed ;)
Can you guys please put it again under "Work in Progress" label? Congress changed the data structure, apparently. Tests are breaking again, several years have missing or changed categories (subquota description). I'll try to understand whats going on, if its temporary and try to make it more robust to those glitches/changes.
Can you guys please put it again under "Work in Progress" label?
Done!
There was a change made on how subquota descriptions were translated. I don't know how pythonic it is, but it's fast enough, and most importantly, it does not break if something changes in the backend (descriptions changed and new subquotas were added in the past, and probably will again).
I did no notebook on this, only compared with interactive python and some bash utils, if that count as analysis. In short, I found out that the data is the same between the API versions.
About testing, what kind of tests do you foresee in this case? Is it worth to validate the translation itself? Or even make tests fail if there is new subquota categories?
I did no notebook on this, only compared with interactive python and some bash utils, if that count as analysis. In short, I found out that the data is the same between the API versions.
I encourage you to do either share this analysis here (as a comment) or, even better: once this is merged, add a notebook to serenata-de-amor
repo with the comparison between the two latest versions of reimbursements.xz
(yours and current latest).
About testing, what kind of tests do you foresee in this case? Is it worth to validate the translation itself? Or even make tests fail if there is new subquota categories?
I would use unittest.mock
to avoid I/O basically. I don't need to actually run, for example, urlretrieve
— I need to be sure it was called with the proper arguments. This means we don't depend on file system (mocking read/write methods) and network (mocking requests), i.e. faster tests, and tests restricted to what actually need to be tested.
These mocks are not trivial, so I don't push everyone to write them — unless anyone is up for pairing on that one of these days ; )
Looking into all that, @cuducos.
Just to make it public, found out yesterday that the congress is not sending the expenses issue date on the 2.0 dataset. PR on hold until they fix it. A complaint was filed to them, I'll post you guys and gals, once they do something about it.
Don't mind me, just rebasing with master to keep things saner when/if we ever merge this :P
Any news from them related to the missing issue-date
?
The complaint is getting several internal updates by the Camara staff over the last week, but it's invisible to me what those updates are. I can only hope they are talking about it and expect some public update soon.
@luizcavalcanti Can you send me (privately is ok) the exact message you sent, with contact id? I can forward it directly to the team responsible for open data in the Chamber of Deputies.
Protocol: 170424-000119
"Gostaria de comunicar que a versão 2.0 dos arquivos de prestação de contas da cota para exercÃcio de atividade parlamentar (CEAP), disponÃvel em http://www2.camara.leg.br/transparencia/cota-para-exercicio-da-atividade-parlamentar/dados-abertos-cota-parlamentar, as datas de emissão (campo datEmissao) estão vindo vazias em todos os formatos (xml, csv, json e xlsx).
Estes dados estão disponÃveis na versão 1.0 (AnoAtual.zip e AnosAnteriores.zip), mas não na versão 2.0. É importante que isso seja ajustado para que possamos usar o novo formato de acesso aos dados da CEAP, que é muito mais eficiente que o anterior."
A jupyter notebook was made to validate this PR. It has its own PR: https://github.com/datasciencebr/serenata-de-amor/pull/241
just passing by to mention that is required a version bump here
cc @cuducos
just passing by to mention that is required a version bump here
Thanks! I was so enthusiastic about it I was already forgetting… @luizcavalcanti in setup.py
you can make an micro version bump (if #71 is merged soon with 10.0.2
yours could be 10.0.3
I guess).
Also if you could check merge conflicts it helps a lot ; ) Otherwise I'll sort that later ; )
Version bumped and whole branch rebased to current master
Yay… that's a great contribution, @luizcavalcanti! Many thanks
The brazilian congress made available a new version of the reimbursement dataset, which now can be downloaded already in csv format and separated by year. This somewhat simplifies the current fetching routine, but it also demands a lot of small changes in the way the dataset is aquired, translated, loaded and merged.
I this rather large commit, the unit tests were also modified to conform to the new way of fetching, and the "convert to csv" test was removed, since this operation is not needed anymore. The convert_to_csv() mehtod from the CEAPDataset class was kept, though, so we don't break rosie until it removes the call to this method.
Another quite drastic change was made in the test cases' names. A number was added to them, so we can 'make sure' a proper order is followed during integration tests. I strongly believe there is a better way of doing this. I can work on it in a near-future commit.
Signed-off-by: Luiz Carlos Cavalcanti cavalcanti.luiz@gmail.com