Open cuducos opened 7 years ago
Related to #76
Any idea of how often should this collection happen ? At least each election ? The ideia here is just build the dataset ?
Hi @Lrodlima,
Any idea of how often should this collection happen?
Take look in the article linked in the opening post and the author discusses the source of data there. Then we have some examples of data collection in research/src/
here and mainly in the toolbox.
At least each election?
I think we could have a single database with year as a column. Maybe take this article can help in structuring the dataset ; )
The ideia here is just build the dataset ?
Yep. The idea is to build a dataset, but the linked issue (#76) already proposed a possible analysis using this data. Basically this is related to the data collection and the other one with analysis but there is absolute no problem in tackling both issues with one PR/contribution ; )
Hey guys,
This is my first time trying to contribute with this awesome project. Therefore, as my first attempt, I decided to help solving this issue. As you can see in this link, which takes you to my fork from the project, I've created a script that takes all csvs from 2010 until the last election (available in the links provided by the link in the description), put all of them together in one single database of donations for candidates, another one of donations for parties and a third one of donations for committees. The generated files can be temporarily found in this link. I did not make any PR because I have a few questions to make and still some work to do, which depends on the answers for my questions:
I sent these links just for you to take a previous look on what I'm doing. I know there are a few steps to follow to send a final PR for the data/script, regarding names, etc.
I think with this data set we could start working on issue #76 .
Hi @lacerdamarcelo — many thanks for this contribution!
As far as I know, I need to not only provide the collected data, but also provide a script that makes the whole process from the beginning, from the download of the csvs to the process of saving the databases in .xz format. Did I get it right? Is there any example where I can see how should I do this?
Yep, you got ir right. A .py
that does the data collection from the source is the way to go. The datasets we can generate by our side running and testing your script but you can provide a link as you did. Regarding examples, as I just told @Lrodlima in the comment above yours: Then we have some examples of data collection in research/src/
here and mainly in the toolbox.
I need to translate all columns to english, right? The problem is that there are technical words in many of them and I am not sure if I would be able to translate. Is it possible for anyone here to help with this translation?
Sure thing, just share the word/expressions and we help you out.
Is it ok if I keep these 3 files separated? Even though they are about donations, these donations go to different types of destinations (candidates, parties and committees).
I think it makes sense. Are the structure of the CSV different (I mean, do they have different columns)?
Hey guys,
Thanks @cuducos for the reply. I will keep working on the script to make it the way it must be.
Regarding the columns names, here are they:
Cargo
CNPJ Prestador Conta
Cod setor econômico do doador
Cód. Eleição
CPF do candidato
CPF do vice/suplente
CPF/CNPJ do doador
CPF/CNPJ do doador originário
Data da receita
Data e hora
Desc. Eleição
Descrição da receita
Entrega em conjunto?
Especie recurso
Fonte recurso
Municipio
Nome candidato
Nome da UE
Nome do doador
Nome do doador (Receita Federal)
Nome do doador originário
Nome do doador originário (Receita Federal)
Numero candidato
Número candidato doador
Numero do documento
Número partido doador
Numero Recibo Eleitoral
Numero UE
Sequencial Candidato
Setor econômico do doador
Setor econômico do doador originário
Sigla da UE
Sigla Partido
Sigla UE doador
Tipo doador originário
Tipo receita
UF
Valor receita
I would really appreciate if you guys could translate them. Many of them are from a quite specific field and I do not feel confident enough to translate them by myself.
Regarding the files I sent a link, please, do not use them. I just realized that there are a few duplicate columns (ok, you can use them, but you should be aware of this issue). Sorry about that. Anyway, I will release the script as soon as it is ready to be used and the generated data is correct.
Hey guys, gave it my best show. Counting on others to revise my work.
Cargo = position/office CNPJ Prestador Conta = ?(CNPJ is company's ID) I believe it's the party's ID in case the donation was received from anothe party (more on thato in the CPF/CNPJ do doador originário) Cod setor econômico do doador = Donator's economic sector code (I believe it's a classification) Cód. Eleição = CPF do candidato = Cadidate's CPF (equivalent to social security number) CPF do vice/suplente = deputy's/substitute's CPF CPF/CNPJ do doador = Donor's CPF or CNPJ (CNPJ is company's ID) CPF/CNPJ do doador originário = Original donor's CPF or CNPJ (In case the donation was received from anothe party or candidate it's necessary to disclose the original donor who donated that money to them) Data da receita = Revenue date Data e hora = date and time Desc. Eleição = Election description? Descrição da receita = Revenue description Entrega em conjunto? = ? Whether or not it was a joint delivery (not sure what it mean) Especie recurso = The kind of revenue Fonte recurso = Source of the revenue Municipio = Municipality Nome candidato = Cadidate's name Nome da UE = Federation Unit's name (State's name) Nome do doador = Donator's name Nome do doador (Receita Federal) = ? Donator's name in the brazilian IRS (not sure about this one) Nome do doador originário = Original donor's name Nome do doador originário (Receita Federal) = ?Original donor's name in the brazilian IRS (not sure about this one) Numero candidato = Cadidate's number Número candidato doador = Donating canditate's number (I believe it's for the case when the money was donated by another cadidate) Numero do documento = Document's number Número partido doador = Donor party's number (I believe it's for the case when the money was donated by another party) Numero Recibo Eleitoral = Electoral receipt number Numero UE = ? (not sure) maybe Coalition number?? just a guess Sequencial Candidato = (not sure) Cadidate's sequential? Setor econômico do doador = Donor's economic sector Setor econômico do doador originário = Original donor's economic sector Sigla da UE = maybe Coalition's acronym?? just a guess Sigla Partido = Party's acronym Sigla UE doador = maybe Donating coalition's acronym?? just a guess Tipo doador originário = Orignal donor's type Tipo receita = Type of revenue UF = Ferderation Unit (State) Valor receita = Revenue value
Se alguém puder revisar, principalmente nas partes que tiver dúvida agradeço. Espero q ajude.
Many many thanks @lacerdamarcelo and @GustavoSFCoelho! Gonna work on awesome Gustavo's suggestions and kind of format them as snake case, as we use in Pandas and CSVs:
pt-BR | en-US |
---|---|
Cargo | post |
CNPJ Prestador Conta | accountable_company_id |
Cod setor econômico do doador | donor_economic_setor_id |
Cód. Eleição | election_id |
CPF do candidato | candidate_cpf |
CPF do vice/suplente | substitute_cpf |
CPF/CNPJ do doador | donor_cnpj_or_cpf |
CPF/CNPJ do doador originário | original_donor_cnpj_or_cpf |
Data da receita | revenue_date |
Data e hora | date_and_time |
Desc. Eleição | election_description |
Descrição da receita | revenue_description |
Entrega em conjunto? | batch |
Especie recurso | type_of_revenue |
Fonte recurso | source_of_revenue |
Municipio | city |
Nome candidato | candidate_name |
Nome da UE | electoral_unit_name |
Nome do doador | donor_name |
Nome do doador (Receita Federal) | donor_name_for_federal_revenue |
Nome do doador originário | original_donor_name |
Nome do doador originário (Receita Federal) | original_donor_name_for_federal_revenue |
Numero candidato | candidate_number |
Número candidato doador | donor_candidate_number |
Numero do documento | document_number |
Número partido doador | donor_party_number |
Numero Recibo Eleitoral | electoral_receipt_number |
Numero UE | electoral_unit_number |
Sequencial Candidato | candidate_sequence |
Setor econômico do doador | donor_economic_sector |
Setor econômico do doador originário | original_donor_economic_sector |
Sigla da UE | electoral_unit_abbreviation |
Sigla Partido | party_acronym |
Sigla UE doador | donor_electoral_unit_abbreviation |
Tipo doador originário | original_donor_type |
Tipo receita | revenue_type |
UF | state |
Valor receita | revenue_value |
@cuducos you missed a underline here 👉 original donor_name_for_federal_revenue
@cuducos there is a 'c' missing in original_donor_cnpj_or_pf -> original_donor_cnpj_or_cpf
Thanks @pedrommone and @GustavoSFCoelho — edited/fixed ; )
Wow! Thanks a lot guys! I'm still working on the script to make it fully automated. Probably by this weekend I'll finish and create the PR.
Hi guys, I've been working on the notations, in particular I've a code here (one using benford law and other using complex networks): https://github.com/felipeeeantunes/DataScienceBR (resources of two papers, one accepted in Physica A and other already submitted).
An already munged data set can be found here: https://www.kaggle.com/felipeleiteantunes/electoral-donations-brazil2014
I'll follow this discussion and contribute. In particular, I developed Python classes to access the donations, that I will publish soon (next weekend).
Hey guys,
I'm almost done with the script. I would like to know where in the repository should the code be (is there any specific place for these scripts?) and where should I save the .xz files.
I would like to share again the current code here just for you guys to check if there are things that I should fix (to follow some standard, maybe).
Thanks!
Hi @lacerdamarcelo : )
I would like to know where in the repository should the code be (is there any specific place for these scripts?) and where should I save the .xz files.
The contribution guide has a section explaining the purpose of each directory I think it's a good idea for you to review that in case of doubt — we kind of explain why data collection scripts go in research/src/
and why we don't commit the dataset themselves.
I would like to share again the current code here just for you guys to check if there are things that I should fix (to follow some standard, maybe).
I would encourage you to open a PR, even if you want to name it a WIP (work in progress) PR — it's way easier to code review and feedback using the PR features here on GitHub ; )
Hey @cuducos
Here is the PR: https://github.com/datasciencebr/serenata-de-amor/pull/290
In the description I listed what I still need to do to finish the code. Waiting for feedbacks about the project's standard.
Guys, I've updated the code and tried to improve the readability the most I could. More comments in the PR. Waiting for new feedback!
I would like to ask you guys to make two new translations which, somehow, I skipped when I asked for help for the first time:
Sequencial Diretorio Tipo diretorio
P.S.: thanks for the latter feedback, I've learned new things =)
. P.S.: thanks for the latter feedback, I've learned new things =)
❤️
Sequencial Diretorio Tipo diretorio
Do you know what diretório means in the context? In some context it could be translated like union (Student Union sometimes is an acceptable translation for Diretório Acadêmico), or committee, or board. I'm not sure. A bit of context might clarify the issue ; )
Honestly, I have no clue =( Could anyone else clarify this issue?
I will try to find someone able to help us...
Yep, party board makes sense ; ) Many thanks!
Translation included =)
Hey,
I don't know if it's too late now, but we've ( @rafapolo @belisards @turicas ) been trying to standardize the data of financial campaign. We created a bash script that download, clean and upload all CSV from TSE (our "Electoral Court") in a single SQL database. We've all the data from all campaigns since 2002 - and it's possible to import "semi-official" data from 1994 and 1998 too.
There is still some work to do for optimize the data and queries, but maybe it could be useful for you: http://github.com/rafapolo/tribuna
Collect data about electoral campaign donors in order to allow analysis like that (pt-br) — politicians spend money in some companies that are campaign donors.