Closed Irio closed 7 years ago
Hi @anapaulagomes, the short answer is we just need to be creative hahaha…
The long answer is that we have talked about some possibilities: some info is available in the Federal Revenue (search for a CNPJ, then click on “Consulta QSA / Capital Social” or something like “Certidão de Baixa de Inscrição” if the company is inactive).
Unfortunately this if under a CAPTCHA. We have been in touch with people trying to code a workaround with a 10% of success rate if that helps. The juntas comerciais also have this information, but their are related to each state so their API might differ considerably.
We can try to scrap LinkedIn or Facebook trying to scrap some data, but that might be difficult (no CNPJ to match, different names, different job titles, outdated and unofficial info etc.).
And there is also alternative sites to look up CNPJ info, but not sure if they offer any info on the partners.
Sure @anapaulagomes.
For main info about companies, we've been using ReceitaWS; it's fairly reliable, but as you can see in this example, does not include the partner list:
http://receitaws.com.br/v1/cnpj/02703510000150
{
"atividade_principal": [{
"text": "Restaurantes e similares",
"code": "56.11-2-01"
}],
"data_situacao": "14/12/2002",
"tipo": "MATRIZ",
"nome": "FRANCISCO RESTAURANTE LTDA - EPP",
"telefone": "(61) 3226-2626",
"situacao": "ATIVA",
"bairro": "ASA SUL",
"logradouro": "Q SHC/SUL CL QUADRA 402 BLOCO B LOJA 05, 09, 15",
"numero": "S/N",
"cep": "70.237-500",
"municipio": "BRASILIA",
"uf": "DF",
"abertura": "27/06/1988",
"natureza_juridica": "206-2 - SOCIEDADE EMPRESARIA LIMITADA",
"cnpj": "02.703.510/0001-50",
"ultima_atualizacao": "2016-08-24T16:58:50.057Z",
"status": "OK",
"fantasia": "",
"complemento": "",
"email": "",
"efr": "",
"motivo_situacao": "",
"situacao_especial": "",
"data_situacao_especial": "",
"atividades_secundarias": [{
"code": "00.00-0-00",
"text": "NĂŁo informada"
}]
}
Here's a step by step to get them (as a user) from Federal Revenue's website, probably the best from the official sources:
As mentioned by @cuducos, we know people breaking it using Tesseract (OCR), but frequently gets blocked by Federal Revenue's servers given its low accuracy of 10%. Another way I can think about breaking it is using Machine Learning; computer vision is one of the most researched areas in Deep Learning nowadays. e.g. https://deepmlblog.wordpress.com/2016/01/03/how-to-break-a-captcha-system/ (has a paper at the end)
Hi folks, first of all, I didn't get yet why we are talking in english...
The main idea is to discover the biggest amounts: it will be for printer bureau, Video producers and adversting companies. Restaurants or brothels would result just on tips and cents and we will not discover anything interesting.
So my suggestion is first to discover who are the big suppliers and which activities they made. After that, try to discover the partners.
An interesting buzz post: https://www.facebook.com/teofb/posts/1186854511378646
He did by hand without any programming ;)
If we use the voice recognition on audio captcha instead of OCR on image, isn't it easier to recognize and more accurate?
It's just a idea, I don't know which is better.
Hi @josircg,
Welcome to Serenata de Amor. I'll try to address all your points, but let me know if I forget any of them, ok?
Hi folks, first of all, I didn't get yet why we are talking in english...
It's on the bottom of our README.md
, the homepage of the project here at GitHub: A conversa sobre o projeto acontece em um grupo do Telegram — tudo em inglĂŞs, já que temos contribuidores de outros paĂses e queremos contribuir com outros paĂses tambĂ©m.
Does that make sense?
The main idea is to discover the biggest amounts […]
To keep it short: the main idea is to use computing power to find more cases than humans, doing it manually, would be able find. That's the purpose of the project. Surely big cases are eye-catching, but we depart from the assumption that corruption starts small — according to that there is an important value of focusing also on the small cases.
He did by hand without any programming ;)
This post is amazing, as it is OPS: they, doing it all manually, denounced and succeed in cases summing more than R$5 million. We do not compete or replace these example. We are inspired by them and try to expand their investigative power ; )
@Lrcezimbra's idea looks amazing! Is there any project/script using this strategy to break CAPTCHA?
Good inputs! I'd like to work on it (I can't assign to myself). I was looking for different sources and I found this interesting tool called Câmara Transparente developed by FGV. I'll take a look on other sources and keep you updated.
not sure if we can get the data we want for this, but I just found this site http://www.consultasocio.com/
that allows listing what companies that someone is a partner, and the other partners in those companies for example http://www.consultasocio.com/q/sa/abel-salvador-mesquita-junior
we can try to scrap that data starting with the politician names, and create a database with what companies belongs to each deputy, who else is partner in those companies and the companies that belongs to those partners
if this will help, I can assign the issue to myself and write the scrapper
@urubatan I'm starting to work on this issue but you can certainly help! :) Maybe you can do the scrapper for this website also and I'll work in the another.
@Irio we have about 1 million companies with information about partners (QSA) in our database, such data is not acessible by an API but if you provide us with a list o names, we will be happy to run a search for it. All information was obtained from the Receita Federal website and is public. Please note that the QSA information does not include the CPF number, so searches are base on the name and therefore are subject to namesake.
@urubatan That sounds very promising.
@tomascamargo Would be possible to query 433 names? đź‘€ These are all the unique congresspeople listed since the creation of CEAP.
@irio great, my python is not great (planning to help the project to learn python and data science :P ), but I'll write a scrapper for that and send a pull request.
@irio yes we try with this. Can you please provide us the names.
Hey guys, I was able to get the information from the main site without needing CAPTCHA, I'm finishing a poc here and will send you.
@mtrovo have you published the results?
I published to https://github.com/awerlang/cnpj-rfb a tool to fetch a company partner list (only names). There's manual step requiring you to visit RFB website then the rest is automated. I found some companies break the process, then you'll need to repeat. It would be best to filter this companies ("baixada", "natureza jurĂdica inválida", some "S.A.", "filial"). I guess this is our best shot atm. If anyone have a list of all the companies we need to query let's run through this tool.
This is a list with 8417 CNPJs and CPFs found on expenses in 2016 up to now. Also I'm attaching ~80 CNPJs I was able to fetch with the tool I anounced the other day. Currently I'm experiencing a connection reset every ~40 operations (ECONNRESET, and I'm not able to access the web page on RFB for some time), that would mean about 200 runs ~= several days. And we would want to run on 2015 data too.
Also I updated the tool to ignore CPFs, and CNPJ entries with certain conditions that would lead to error conditions.
Help is most appreciated! How can we unblock HTTP access in this situation?
https://gist.github.com/awerlang/3a8b3f286a0bcceb2ae367ad2e09af21
Summarizing this topic
I'm sorry if my statements looks like I'm pointing to failures but I wan't to highlight what the pain-points really are.
@tomascamargo has a great private database. If we ask you for a query for a long list (60k) of CNPJ, would you generate a simple dump for us (cnpj
, partner
) — in case of more than one partner, one cnpj
could be repeated in subsequent rows? If so, I'll export a list of CNPJ today.
@awerlang Your solution looks great — it's the best we have so far, but restaring it, plus latency due to probable server-side block, plus manual session ID copying and paste is still an issue. Maybe the barrier is as great as breaking the captcha (10% rate of success and successive blocks). If @tomascamargo can bootstrap this dataset with a query in his system, we can use your solution to update the dataset when we get new data from the Lower House.
@urubatan Consulta SĂłcio is useful to get companies registered in politicians' names, but it would also be very interesting to have the full list of partner (get companies held by politicians relatives as in #107 for example). That's why I'm still writing in this Issue ; )
I'm willing to put some effort in this issue to create this dataset. I'm glad for all the discussion, references and opportunities. Let's put the pieces together to make it work ; )
Fresh news about Consulta SĂłcio:
If we get this database, it should be obfuscated somehow or it can be a trojan horse against whole project.
@cuducos I see you guys are having some trouble breaking the captcha. There are services that use humans
to resolve the captcha. You can see some examples here: https://www.troyhunt.com/breaking-captcha-with-automated-humans/.
@pedrommone even if you break captcha (which is a non-issue) you'll get blocked by RFB servers after a few requests.
@awerlang so, the main problem is the request limit throttle? If its, using Tor as a proxy make some sense here.
Won't the proxy get blocked too?
@awerlang inside tor networking, every request you do, you can force it use a different IP. If you want more info, go here: https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/ and here http://dm295.blogspot.com.br/2016/02/tor-ip-changing-and-web-scraping.html
@pedrommone actually breaking the captcha is not the real issue (check #42). We just miss a script that automatically use one captcha breaking solution to collect all the info we need. The closest we got was @awerlang's solition (posted in thia topic).
oh, thanks for information @cuducos.
@awerlang now, I guess, you can easily unblock http requests with TOR :)
@pedrommone I won't be pursuing this route for the moment. Thanks for sharing this information though.
Guys, have you already checked again the ReceitaWS?
Querying the same example that @Irio did above (http://receitaws.com.br/v1/cnpj/02703510000150), I can get the partners lists:
{
"atividade_principal": [
{
"text": "Restaurantes e similares",
"code": "56.11-2-01"
}
],
"data_situacao": "14/12/2002",
"nome": "FRANCISCO RESTAURANTE LTDA - EPP",
"uf": "DF",
"telefone": "(61) 3226-2626",
"qsa": [
{
"qual": "22-SĂłcio",
"nome": "CINTHIA MAYUMI KITAHARA"
},
{
"qual": "49-SĂłcio-Administrador",
"nome": "GIULIANA ANSILIERO"
},
{
"qual": "22-SĂłcio",
"nome": "EDSON RICARDO MONTESCHIO NUNES"
},
{
"qual": "22-SĂłcio",
"nome": "BRUNA MONTESCHIO NUNES MARQUES"
}
],
"situacao": "ATIVA",
"bairro": "ASA SUL",
"logradouro": "Q SHC/SUL CL QUADRA 402 BLOCO B LOJA 05, 09, 15",
"numero": "S/N",
"cep": "70.237-500",
"municipio": "BRASILIA",
"abertura": "27/06/1988",
"natureza_juridica": "206-2 - Sociedade Empresária Limitada",
"cnpj": "02.703.510/0001-50",
"ultima_atualizacao": "2017-02-11T04:37:40.742Z",
"status": "OK",
"tipo": "MATRIZ",
"fantasia": "",
"complemento": "",
"email": "",
"efr": "",
"motivo_situacao": "",
"situacao_especial": "",
"data_situacao_especial": "",
"atividades_secundarias": [
{
"code": "00.00-0-00",
"text": "NĂŁo informada"
}
],
"capital_social": "0.00",
"extra": {}
}
Maybe they had updated the service. =)
/cc @Irio @cuducos
That's awesome! And this issue now is about adapting the current scripts to save that info — which is way easier!
Thanks for the heads up, @marcusrehm!
Hey guys, is there anyone working on this issue?
AFAIK there isn't. @jtemporal and I took a look on the two scripts collecting data from ReceitaWS we felt that they could be refactored before collecting data again — but this is not our priority right now. Feel free to jump in, coding or discussing ; )
same thing here, AFAIK there isn't. Feel free to adopt it @marcusrehm ;) it would be much appreciated
Yes! I can get this one. Actually I was waiting for this one to be done so we could play with that data on neo4j... :smile:
Could you please point out which scripts are related to this issue? If you can, we can discuss about the refactoring also, then we can see what can be done along with the inclusion of the partners list.
/cc @jtemporal @cuducos
Two scripts basically: fetch_cnpj_info.py
and clean_cnpj_info_dataset.py
. My comments in favor of a massive refactor:
The script is not so effective: it easily started to be blocked by ReceitaWS and we have to re-run it several times
Just checked: Try to better handle 429 too many requests responses
Maybe no cleaning process could be done in a parser logic, not in an external script
And maybe move it to the toolbox
I think we can work on the data acquisition, coding the script to fetch the new data. After that we can work on the issue about request, in the end this one is needed in order to grab the data.
About the requests, do you know or have already tried use Tor? I was thinking about use it and in case it isn't allowed on the client network the script can then use batch processing to request in a specific time frame.
Then we can refactor the whole script, but we will already have the logic to get this running.
What you think?
About the requests, do you know or have already tried use Tor?
TBH I just used Tor browser manually, that is to say, never integrated in a Python script. If this is possible and if this doesn't require a too specific setup (so contributors can get started in the project easily) I have no concerns about it. Otherwise I think that a proper handling of HTTP 429
and maybe some semaphore controlling the amount and frequency of requests might be enough.
I think we can work on the data acquisition, coding the script to fetch the new data.
Sure thing. I think it is a good start to start with data collection and once we can gather the information we're looking for we can look to the script and figure out the best way to deal with request traffic ; )
Hey Guys! Just got the script fetching and saving the partners list. It's available at my repo. It's the same script but saving the data of partners, now I'm working to better handle the 429 error related with too many requests.
About the 429
error, I think Tor is not a good option as people need to install it apart from serenata de amor and I think it is out of scope for the project (as @cuducos mentioned, it should be easy for the contributor to setup). So I started to work in handling the requests errors and I got some concerns about the script:
429
too fast and besides that I also getting [Errno 24] Too many open files
after handle the 429
.pkl
files and then put it on a DataFrame, we could do it right after get the data from receitasWS.Do you have any concerns about the points above or I can continue with this approach?
/cc @cuducos @jtemporal
Yaay this is awesome! I believe you are about right on that approach, AFAIK the parallelism part was used so we could generate the dataset faster, right now if we remove de the parallelism and get it to work properly, we can think about parallelize it again later. =)
Cool @jtemporal ! I'm gonna do it this way then.
And about the file naming convention? Do you think that it is ok also?!
Good points, @marcusrehm — I agree with you and @jtemporal on everything you've said. ABout the naming convention: we could have just YYYY-MM-DD-companies.xz
I guess, addressing the partner names the same way we did with secondary_activity_XX[_code_]
. Maybe the script wasn't versioning companies.xz
, but we were doing it manually — it would be great to have it done automatically.
it would be great to have it done automatically
Automate all the things o/ 🎉
Maybe the script wasn't versioning companies.xz
That's about right! Versioning was happening when the dataset was being uploaded to the S3.
Hey folks!
Got a new version of fetch_cnpj_info.py
script here.
I brought the threading back but with some improvements. Now we can pass as arguments how many threads to use and a list of http proxies, so each request uses a randomly chosen proxy from the list or none (use the local ip). The script still taking time but fetches faster than old version.
Also for each batch of 100 requests it saves the cnpj-info.xz
dataframe so if we have some issue or it interrupts abruptly we don't loose the work and the script can restarts from where it stopped.
The script can be called the same way as it is now (so nothing will break because of that) or it can be called as python ./src/fetch_cnpj_info.py ./data/2016-11-19-current-year.xz -p 177.67.84.135:8080 177.67.82.80:8080 177.67.82.80:3128 179.185.54.114:8080 -t 20
where -p
or --proxies
receives the list of proxies and -t
or --threads
the number of threads.
For the clean_cnpj_info_dataset.py I put the naming file convention to save the companies.xz
file and adjusted partner list as @cuducos suggested here:
I guess, addressing the partner names the same way we did with secondary_activity_XX[code]. Maybe the script wasn't versioning companies.xz, but we were doing it manually — it would be great to have it done automatically.
My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file? Having multiple columns with the "same" data can increase the difficulty to make filter or joins, don't you think?
I can create a PR so you guys can review and use it. Now I think we can work in bring the clean
logic to the fetch
script and have just one script for handle all process ok?
/cc @cuducos @jtemporal
Now we can pass as arguments how many threads to use and a list of http proxies
That's awesome!
The script still taking time but fetches faster than old version.
🎉 Many thanks, @marcusrehm!
My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file?
That's totally fine IMHO ; )
I can create a PR so you guys can review and use it.
Sure thing. Once the PR is opened I can offer a proper code review, but in a quick look it looks as a very good improvement — again, many thanks for that.
Once you open the PR if you have a dataset generated you can add a link to it if you like ; )
Hey guys! It took more days than I thought but the PR #218 is there. When the dataset download finish I post the link here ok?
the link to download the dataset https://we.tl/Zsw7zPhV6a
Does #218 (merged) closes this Issue? cc @jtemporal
Also the dataset is not available in the S3 yet, is it? Does anyone still have a copy so we can upload it? cc @jtemporal
I uploaded the file again https://we.tl/TTYzFTk5d8.
About close the issue, if this was just the acquisition of partners list, then I think it's done, but I didn't do any analysis with it (yet). :)
cc @cuducos @jtemporal
Hi @Irio! Could you please clarify which are the source to collect the data? Or we just need to be creative? :) Thank you.