okfn-brasil / serenata-de-amor

🕵 Artificial Intelligence for social control of public administration | **This repository does not receive frequent updates. Check out the README**
https://serenata.ai/en
MIT License
4.52k stars 662 forks source link

Collect data on campaign donations #275

Open cuducos opened 7 years ago

cuducos commented 7 years ago

Collect data about electoral campaign donors in order to allow analysis like that (pt-br) — politicians spend money in some companies that are campaign donors.

cuducos commented 7 years ago

Related to #76

disouzaleo commented 7 years ago

Any idea of how often should this collection happen ? At least each election ? The ideia here is just build the dataset ?

cuducos commented 7 years ago

Hi @Lrodlima,

Any idea of how often should this collection happen?

Take look in the article linked in the opening post and the author discusses the source of data there. Then we have some examples of data collection in research/src/ here and mainly in the toolbox.

At least each election?

I think we could have a single database with year as a column. Maybe take this article can help in structuring the dataset ; )

The ideia here is just build the dataset ?

Yep. The idea is to build a dataset, but the linked issue (#76) already proposed a possible analysis using this data. Basically this is related to the data collection and the other one with analysis but there is absolute no problem in tackling both issues with one PR/contribution ; )

lacerdamarcelo commented 7 years ago

Hey guys,

This is my first time trying to contribute with this awesome project. Therefore, as my first attempt, I decided to help solving this issue. As you can see in this link, which takes you to my fork from the project, I've created a script that takes all csvs from 2010 until the last election (available in the links provided by the link in the description), put all of them together in one single database of donations for candidates, another one of donations for parties and a third one of donations for committees. The generated files can be temporarily found in this link. I did not make any PR because I have a few questions to make and still some work to do, which depends on the answers for my questions:

I sent these links just for you to take a previous look on what I'm doing. I know there are a few steps to follow to send a final PR for the data/script, regarding names, etc.

I think with this data set we could start working on issue #76 .

cuducos commented 7 years ago

Hi @lacerdamarcelo — many thanks for this contribution!

As far as I know, I need to not only provide the collected data, but also provide a script that makes the whole process from the beginning, from the download of the csvs to the process of saving the databases in .xz format. Did I get it right? Is there any example where I can see how should I do this?

Yep, you got ir right. A .py that does the data collection from the source is the way to go. The datasets we can generate by our side running and testing your script but you can provide a link as you did. Regarding examples, as I just told @Lrodlima in the comment above yours: Then we have some examples of data collection in research/src/ here and mainly in the toolbox.

I need to translate all columns to english, right? The problem is that there are technical words in many of them and I am not sure if I would be able to translate. Is it possible for anyone here to help with this translation?

Sure thing, just share the word/expressions and we help you out.

Is it ok if I keep these 3 files separated? Even though they are about donations, these donations go to different types of destinations (candidates, parties and committees).

I think it makes sense. Are the structure of the CSV different (I mean, do they have different columns)?

lacerdamarcelo commented 7 years ago

Hey guys,

Thanks @cuducos for the reply. I will keep working on the script to make it the way it must be.

Regarding the columns names, here are they:

Cargo
CNPJ Prestador Conta
Cod setor econômico do doador
Cód. Eleição
CPF do candidato
CPF do vice/suplente
CPF/CNPJ do doador
CPF/CNPJ do doador originário
Data da receita
Data e hora
Desc. Eleição
Descrição da receita
Entrega em conjunto?
Especie recurso
Fonte recurso
Municipio
Nome candidato
Nome da UE
Nome do doador
Nome do doador (Receita Federal)
Nome do doador originário
Nome do doador originário (Receita Federal)
Numero candidato
Número candidato doador
Numero do documento
Número partido doador
Numero Recibo Eleitoral
Numero UE
Sequencial Candidato
Setor econômico do doador
Setor econômico do doador originário
Sigla da UE
Sigla Partido
Sigla UE doador
Tipo doador originário
Tipo receita
UF
Valor receita

I would really appreciate if you guys could translate them. Many of them are from a quite specific field and I do not feel confident enough to translate them by myself.

Regarding the files I sent a link, please, do not use them. I just realized that there are a few duplicate columns (ok, you can use them, but you should be aware of this issue). Sorry about that. Anyway, I will release the script as soon as it is ready to be used and the generated data is correct.

gusrabbit commented 7 years ago

Hey guys, gave it my best show. Counting on others to revise my work.

Cargo = position/office CNPJ Prestador Conta = ?(CNPJ is company's ID) I believe it's the party's ID in case the donation was received from anothe party (more on thato in the CPF/CNPJ do doador originário) Cod setor econômico do doador = Donator's economic sector code (I believe it's a classification) Cód. Eleição = CPF do candidato = Cadidate's CPF (equivalent to social security number) CPF do vice/suplente = deputy's/substitute's CPF CPF/CNPJ do doador = Donor's CPF or CNPJ (CNPJ is company's ID) CPF/CNPJ do doador originário = Original donor's CPF or CNPJ (In case the donation was received from anothe party or candidate it's necessary to disclose the original donor who donated that money to them) Data da receita = Revenue date Data e hora = date and time Desc. Eleição = Election description? Descrição da receita = Revenue description Entrega em conjunto? = ? Whether or not it was a joint delivery (not sure what it mean) Especie recurso = The kind of revenue Fonte recurso = Source of the revenue Municipio = Municipality Nome candidato = Cadidate's name Nome da UE = Federation Unit's name (State's name) Nome do doador = Donator's name Nome do doador (Receita Federal) = ? Donator's name in the brazilian IRS (not sure about this one) Nome do doador originário = Original donor's name Nome do doador originário (Receita Federal) = ?Original donor's name in the brazilian IRS (not sure about this one) Numero candidato = Cadidate's number Número candidato doador = Donating canditate's number (I believe it's for the case when the money was donated by another cadidate) Numero do documento = Document's number Número partido doador = Donor party's number (I believe it's for the case when the money was donated by another party) Numero Recibo Eleitoral = Electoral receipt number Numero UE = ? (not sure) maybe Coalition number?? just a guess Sequencial Candidato = (not sure) Cadidate's sequential? Setor econômico do doador = Donor's economic sector Setor econômico do doador originário = Original donor's economic sector Sigla da UE = maybe Coalition's acronym?? just a guess Sigla Partido = Party's acronym Sigla UE doador = maybe Donating coalition's acronym?? just a guess Tipo doador originário = Orignal donor's type Tipo receita = Type of revenue UF = Ferderation Unit (State) Valor receita = Revenue value

Se alguém puder revisar, principalmente nas partes que tiver dúvida agradeço. Espero q ajude.

cuducos commented 7 years ago

Many many thanks @lacerdamarcelo and @GustavoSFCoelho! Gonna work on awesome Gustavo's suggestions and kind of format them as snake case, as we use in Pandas and CSVs:

pt-BR en-US
Cargo post
CNPJ Prestador Conta accountable_company_id
Cod setor econômico do doador donor_economic_setor_id
Cód. Eleição election_id 
CPF do candidato candidate_cpf
CPF do vice/suplente substitute_cpf
CPF/CNPJ do doador donor_cnpj_or_cpf
CPF/CNPJ do doador originário original_donor_cnpj_or_cpf
Data da receita revenue_date
Data e hora date_and_time
Desc. Eleição election_description
Descrição da receita revenue_description
Entrega em conjunto? batch
Especie recurso type_of_revenue
Fonte recurso source_of_revenue
Municipio city
Nome candidato candidate_name
Nome da UE electoral_unit_name
Nome do doador donor_name
Nome do doador (Receita Federal) donor_name_for_federal_revenue
Nome do doador originário original_donor_name
Nome do doador originário (Receita Federal) original_donor_name_for_federal_revenue
Numero candidato candidate_number
Número candidato doador donor_candidate_number
Numero do documento document_number
Número partido doador donor_party_number
Numero Recibo Eleitoral electoral_receipt_number
Numero UE electoral_unit_number
Sequencial Candidato candidate_sequence
Setor econômico do doador donor_economic_sector
Setor econômico do doador originário original_donor_economic_sector
Sigla da UE electoral_unit_abbreviation
Sigla Partido party_acronym
Sigla UE doador donor_electoral_unit_abbreviation
Tipo doador originário original_donor_type
Tipo receita revenue_type
UF state
Valor receita revenue_value
pedrommone commented 7 years ago

@cuducos you missed a underline here 👉 original donor_name_for_federal_revenue

gusrabbit commented 7 years ago

@cuducos there is a 'c' missing in original_donor_cnpj_or_pf -> original_donor_cnpj_or_cpf

cuducos commented 7 years ago

Thanks @pedrommone and @GustavoSFCoelho — edited/fixed ; )

lacerdamarcelo commented 7 years ago

Wow! Thanks a lot guys! I'm still working on the script to make it fully automated. Probably by this weekend I'll finish and create the PR.

felipeeeantunes commented 7 years ago

Hi guys, I've been working on the notations, in particular I've a code here (one using benford law and other using complex networks): https://github.com/felipeeeantunes/DataScienceBR (resources of two papers, one accepted in Physica A and other already submitted).

An already munged data set can be found here: https://www.kaggle.com/felipeleiteantunes/electoral-donations-brazil2014

I'll follow this discussion and contribute. In particular, I developed Python classes to access the donations, that I will publish soon (next weekend).

lacerdamarcelo commented 7 years ago

Hey guys,

I'm almost done with the script. I would like to know where in the repository should the code be (is there any specific place for these scripts?) and where should I save the .xz files.

I would like to share again the current code here just for you guys to check if there are things that I should fix (to follow some standard, maybe).

Thanks!

cuducos commented 7 years ago

Hi @lacerdamarcelo : )

I would like to know where in the repository should the code be (is there any specific place for these scripts?) and where should I save the .xz files.

The contribution guide has a section explaining the purpose of each directory I think it's a good idea for you to review that in case of doubt — we kind of explain why data collection scripts go in research/src/ and why we don't commit the dataset themselves.

I would like to share again the current code here just for you guys to check if there are things that I should fix (to follow some standard, maybe).

I would encourage you to open a PR, even if you want to name it a WIP (work in progress) PR — it's way easier to code review and feedback using the PR features here on GitHub ; )

lacerdamarcelo commented 7 years ago

Hey @cuducos

Here is the PR: https://github.com/datasciencebr/serenata-de-amor/pull/290

In the description I listed what I still need to do to finish the code. Waiting for feedbacks about the project's standard.

lacerdamarcelo commented 7 years ago

Guys, I've updated the code and tried to improve the readability the most I could. More comments in the PR. Waiting for new feedback!

I would like to ask you guys to make two new translations which, somehow, I skipped when I asked for help for the first time:

Sequencial Diretorio Tipo diretorio

P.S.: thanks for the latter feedback, I've learned new things =)

cuducos commented 7 years ago

. P.S.: thanks for the latter feedback, I've learned new things =)

❤️

Sequencial Diretorio Tipo diretorio

Do you know what diretório means in the context? In some context it could be translated like union (Student Union sometimes is an acceptable translation for Diretório Acadêmico), or committee, or board. I'm not sure. A bit of context might clarify the issue ; )

lacerdamarcelo commented 7 years ago

Honestly, I have no clue =( Could anyone else clarify this issue?

I will try to find someone able to help us...

lacerdamarcelo commented 7 years ago

@cuducos, a found this link.

Maybe we could translate Diretório to Party Board, as I found an evidence here that this could be the correct translation. What do you think?

cuducos commented 7 years ago

Yep, party board makes sense ; ) Many thanks!

lacerdamarcelo commented 7 years ago

Translation included =)

belisards commented 7 years ago

Hey,

I don't know if it's too late now, but we've ( @rafapolo @belisards @turicas ) been trying to standardize the data of financial campaign. We created a bash script that download, clean and upload all CSV from TSE (our "Electoral Court") in a single SQL database. We've all the data from all campaigns since 2002 - and it's possible to import "semi-official" data from 1994 and 1998 too.

There is still some work to do for optimize the data and queries, but maybe it could be useful for you: http://github.com/rafapolo/tribuna