okfn-brasil / serenata-de-amor

🕵 Artificial Intelligence for social control of public administration | **This repository does not receive frequent updates. Check out the README**
https://serenata.ai/en
MIT License
4.51k stars 661 forks source link

Documents metadata #229

Open cuducos opened 7 years ago

cuducos commented 7 years ago

Following the lead of a Brazilian podcast it looks like there are some lobbyists directly writing laws in Brazil: there are traces in documents metadata that sustains that the Chamber gets some files from these the lobby and just put them forward.

Not sure how far this can help in civic control of the national treasure, but having a dataset of people listed in metadata can be an interesting lead to misusage of the public sphere.

rodolfo-viana commented 7 years ago

According to The Intercept Brasil, they discovered who was behind some amendments by checking the files' metadata:

Para chegar às 292 emendas redigidas pelas associações empresariais, The Intercept Brasil examinou todas aquelas protocoladas até o fim de março – antes, portanto, da apresentação do relatório de Rogério Marinho. Dentro dos arquivos PDF com o conteúdo da emenda e sua justificativa técnica, há metadados que indicam o “autor” original do arquivo, com a identificação do dono do computador onde ele foi redigido.

I emailed one of the reporters a week ago to get more information about this process, but have got no responde so far.

Irio commented 7 years ago

I also sent an email to Breno Costa after listening to the podcast. 🥇

Not sure if this can help us directly with the receipts, but with other analyses leveraging files generated by congresspeople and their advisors. Just in case, I used exiftool with a few receipts I randomly downloaded. At first glance, don't see anything useful in this metadata.

irio@Irios-MacBook-Pro ~> exiftool -a -G1 ~/Downloads/6204370.pdf
[ExifTool]      ExifTool Version Number         : 10.50
[System]        File Name                       : 6204370.pdf
[System]        Directory                       : /Users/irio/Downloads
[System]        File Size                       : 156 kB
[System]        File Modification Date/Time     : 2017:05:09 10:34:57+02:00
[System]        File Access Date/Time           : 2017:05:09 10:35:04+02:00
[System]        File Inode Change Date/Time     : 2017:05:09 10:34:57+02:00
[System]        File Permissions                : rw-r--r--
[File]          File Type                       : PDF
[File]          File Type Extension             : pdf
[File]          MIME Type                       : application/pdf
[PDF]           PDF Version                     : 1.3
[PDF]           Linearized                      : No
[PDF]           Creator                         : XnView
[PDF]           Title                           : Alimentacao_VilladaGastronomia_091283.pdf
[PDF]           Create Date                     : 2017:02:09 15:00:46
[PDF]           Modify Date                     : 2017:02:09 15:00:46
[PDF]           Producer                        : XnView, http://www.xnview.com
[PDF]           Page Count                      : 1
irio@Irios-MacBook-Pro ~> exiftool -a -G1 ~/Downloads/6250711.pdf
[ExifTool]      ExifTool Version Number         : 10.50
[ExifTool]      Warning                         : Kids object (7 0 obj) not found at 58569
[System]        File Name                       : 6250711.pdf
[System]        Directory                       : /Users/irio/Downloads
[System]        File Size                       : 58 kB
[System]        File Modification Date/Time     : 2017:05:09 10:38:32+02:00
[System]        File Access Date/Time           : 2017:05:09 10:38:32+02:00
[System]        File Inode Change Date/Time     : 2017:05:09 10:38:33+02:00
[System]        File Permissions                : rw-r--r--
[File]          File Type                       : PDF
[File]          File Type Extension             : pdf
[File]          MIME Type                       : application/pdf
[PDF]           PDF Version                     : 1.7
[PDF]           Linearized                      : No
[PDF]           Page Count                      : 1
[PDF]           Title                           :
[PDF]           Author                          :
[PDF]           Subject                         :
[PDF]           Creator                         : Scan Assistant
[PDF]           Create Date                     : 2017:04:05 12:05:58Z
[PDF]           Modify Date                     : 2017:04:05 12:05:58Z
[PDF]           Producer                        : secPdfProducer
irio@Irios-MacBook-Pro ~> exiftool -a -G1 ~/Downloads/6204592.pdf
[ExifTool]      ExifTool Version Number         : 10.50
[System]        File Name                       : 6204592.pdf
[System]        Directory                       : /Users/irio/Downloads
[System]        File Size                       : 168 kB
[System]        File Modification Date/Time     : 2017:05:09 10:39:23+02:00
[System]        File Access Date/Time           : 2017:05:09 10:39:23+02:00
[System]        File Inode Change Date/Time     : 2017:05:09 10:39:23+02:00
[System]        File Permissions                : rw-r--r--
[File]          File Type                       : PDF
[File]          File Type Extension             : pdf
[File]          MIME Type                       : application/pdf
[PDF]           PDF Version                     : 1.3
[PDF]           Linearized                      : No
[PDF]           Creator                         : XnView
[PDF]           Title                           : Alimentacao_Senac_2424.pdf
[PDF]           Create Date                     : 2017:02:09 20:03:29
[PDF]           Modify Date                     : 2017:02:09 20:03:29
[PDF]           Producer                        : XnView, http://www.xnview.com
[PDF]           Page Count                      : 2
irio@Irios-MacBook-Pro ~>
fcevado commented 7 years ago

@Irio afaik, they used the Author metadata(if it was created with a licensed application the author usually is the owner of the license). Depending on how the file was created this info is present.

fcevado commented 7 years ago

just an example:

ExifTool Version Number         : 10.51
File Name                       : PEC 287-2016.pdf
Directory                       : .
File Size                       : 618 kB
File Modification Date/Time     : 2017:05:09 20:10:24-03:00
File Access Date/Time           : 2017:05:09 20:10:24-03:00
File Inode Change Date/Time     : 2017:05:09 20:10:31-03:00
File Permissions                : rw-r--r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 27
Language                        : pt-BR
Tagged PDF                      : Yes
Author                          : Felipe Memolo Portela
Creator                         : Microsoft® Word 2010
Create Date                     : 2016:12:07 13:08:53-02:00
Modify Date                     : 2016:12:07 13:08:53-02:00
Producer                        : Microsoft® Word 2010
cuducos commented 7 years ago

Yep — I think usefulness comes from picking up the right examples. I haven't thought of this as something useful for the receipts (as @Irio tried), but for laws, amendments etc. (as @fcevado tried it).

fcevado commented 7 years ago

Scraping the files is very easy, the page that has the files and the person who submitted it has incremental id. I was thinking to do something like that to all valid Ids as a personal project after I listened to the same podcast, just as a personal project. The biggest problem comes on tracking info from people who authored the file. LinkedIn only let you consult on their api if it's logged as a user. I didnt research about the openness of other social networking api.

cuducos commented 7 years ago

The biggest problem comes on tracking info from people who authored the file. LinkedIn only let you consult on their api if it's logged as a user. I didn't research about the openness of other social networking api.

These kind of scraping is indeed difficult (LinkedIn, Facebook etc.) — but I wouldn't let this discourage data collection. We can take other routes when it comes to analysis (eg graph databases to see a network of people).

fcevado commented 7 years ago

I was thinking to create a fake profile in LinkedIn and search the needed info with that, thought about using Lattes as a source too(it usually has info about where people worked) and the list of advisors.

cuducos commented 7 years ago

I was thinking to create a fake profile in LinkedIn

They'll block in a second, she says…

rodolfo-viana commented 7 years ago

They'll block in a second, she says…

I've got a friend working there. Perhaps he may help.

fcevado commented 7 years ago

They'll block in a second, she says…

That is some of the problem I was expecting, I think even the congress site will block an ip requesting this much... Well, till its figured out how to relate author's data to other sources, I can create a csv with the usefull metadata.

fcevado commented 7 years ago

Just to be documented here, a useful link: http://www2.camara.leg.br/transparencia/dados-abertos/dados-abertos-legislativo/webservices/proposicoes-1/obterproposicaoporid

silviodc commented 7 years ago

After to listen the audio i noticed there are some deputies which propose some amendment and vote against it. What is very suspicious and contradictory.

So, what do you guys think to produce some data considering the mentioned point.

Could we find some interesting knowledge using it?