Closed rodolfo-viana closed 5 years ago
Very good and important point, @rodolfo-viana! I'm almost sure this detail was unnoticed until now… Surely we need to take that into account.
although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?
I believe it was never our intention to cut out an entire sub quota.
this detail was unnoticed until now
And I agree with this.
In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent.
You are completely right! I believe the way to go here is find a way we can, cut out receipts that weren't reimbursed and still have the 999
sub quota expenses in our dataset.
@cuducos any ideas?
I believe this subquota was cut out -- not by Serenata team, of course -- during the time Chamber was setting up its second version of open data. I say that because I had read a notebook that covered Flight ticket issue
: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2016-08-13-irio-descriptive-analysis.ipynb
I guess that when they changed their dataset something went wrong, athough another analysis had led to a positive result: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2017-05-21-luizcavalcanti-chamber-ceap-api-version-comparison.ipynb
Anyway, if I can help somehow, just let me know.
@cuducos any ideas?
Is the data available in the new version of the API or only in the XML version?
If it is, I think we simply enhance the if
statement that excludes reimbursements with zero values to keep them if subquota is 999
. And surely mention that in the documentation because people will ask about that.
Is the data available in the new version of the API or only in the XML version?
Yes it is, I checked to find out if we were cutting it out or by any chance the chamber was. It is the filter we have that was dropping lines with 0 reimbursement value.
If it is, I think we simply enhance the
if
statement that excludes reimbursements with zero values to keep them if subquota is999
.
That was my initial idea!
I took the liberty to rename this issue since we agreed on an approach to have the 999
sub quota in the final dataset ;)
I checked .csv
files and compared 999
to other subquotas. It lacks three rows:
batch_number
(that we hardly use), reimbursement_number
(ditto), and document_id
I believe document_id
rows, as inexistent, are being dropped in reimbursements.py
:
def group(self, receipts):
print('Dropping rows without document_value or reimbursement_number…')
subset = ('document_value', 'reimbursement_number')
receipts = receipts.dropna(subset=subset)
groupby_keys = ('year', 'applicant_id', 'document_id')
receipts = receipts.dropna(subset=subset + groupby_keys)
Is it the issue?
I wouldn't be so sure they are being dropped. Can anyone confirm that in the source 999
subquota have document_id
s?
The thing is that document_id
is not documented anywhere in their material. We guess it's is a kind of unique identifier for the reimbursed. As flight tickets are not actually reimbursed, maybe they never receive this identifier at all…
It seems this def
drops rows of reimbursement_number
and document_id
, both inexistent rows in 999
.
Anyway, if I can help somehow, let me know. :)
It seems this
def
drops rows ofreimbursement_number
anddocument_id
, both inexistent rows in 999.
So this is the problem — they don't come with a document_id
. That's awful. Anyway… pandas
can catch that quite easily I guess. Is that right, @jtemporal?
We're gonna have to discuss this along with Jarbas architecture too — the whole API is based on the uniqueness of document_id
.
Just to explain the process I went through to come to this ideia: I downloaded the .csv
file regarding expenses of this year, opened in Excel, picked 999
and other subquotas, looked up which rows these other subquotas have and 999
do not, and found these three mentioned above.
I am not sure if it is different in .xml
. I believe it is not.
Is that right, @jtemporal?
I guess so, but a I'll test it today with @rodolfo-viana ;)
Yep! a talk about Jarbas architecture is required. To have back in our dataset 999
(and also 10
and 11
for that matter) sub quota we need to study the implications and maybe revisit Jarbas whole structure. Maybe generate a separated dataset for these quotas for now could be a way.
Jarbas used to have an composed unique ID with year
, applicant_id
and document_id
. No problem in recreating something like that. I think it's a heads up about it but the first thing is to generate data, bring them in and see what crashes (in our local machines). The main question is not about Jarbas itself, is about the data (what are the unique ID for each row? just the sequential index?). Even if there's none we can work around (eg no detail view, only list views).
@jtemporal figured out why the dataset is missing subquota
999, 'Flight ticket issue'
(see #106). According to her findings:It is not a bug indeed. The reason: subquota
999, 'Flight ticket issue'
does not generate reimbursement value. According to Chamber of Deputies:I understand the mission of this project regarding reimbursement and how this work flows around reimbursement values. But taking it strictly, we disregard expenses on which the congressperson does not have to get reimbursed; we disregard subquotas in which the congressperson has a monthly value to deduct from.
In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent. And a lot of it: over R$ 100 million during the current term, putting
Flight ticket issue
in second place among subquotas with most expenses.As an example of the relevance of having this subquota in our dataset, a few years ago there was this public scandal called "Farra das passagens", about congresspersons using this specific subquota to issue tickets for his family members and friends.
So I ask you guys: although dropping
Flight ticket issue
from our dataset is not a bug, shouldn't we reconsider having it back?