Add company dataset abstraction

cuducos commented 4 years ago

What is the purpose of this Pull Request?

This initial commit adds an abstraction to build the company dataset. It doesn't include tests yet because of the exploitative way I approached the problem. But tests are listed in the TODO list below.

What was done to achieve this purpose?

Interface to download data from Brasil.IO (the SQLite version of Receita Federal's CNPJ dataset)
Interface to get CNAE (economic activity) description from IBGE website
Interface to get geo-coordinates from Open Street Maps's Nominatim API

How to test if it really works?

$ serenata-toolbx --modules companies

This will look for the datasets in data/ with cnpj_cpf columns, take unique values of cnpj_cpf in these datasets and generate a YYYY-MM-DD-companies.csv.xz.

Who can help reviewing it?

@sergiomario @g4brielvs @turicas

TODO

[x] ~~Use asyncio to query the SQLite (enhance performance)~~ (not a good idea at the moment)
[x] ~~Fix CI broken in #217~~ (#220)
[ ] Write tests for serenata_toolbox.companies
[x] ~~Add companies command to the CLI~~ 6fe22c0
[x] ~~Add companies command to the README.rst~~ 3e2e378
[ ] Version bump
[x] ~~Generate YYYY-MM-DD-companies.csv~~ (link in the next message below)
[x] Upload YYYY-MM-DD-companies.csv to the project storage at DigitalOcean Spaces
[ ] Update serenata_toolbox.datasets.downloader.Downloader.LATEST with new YYYY-MM-DD-companies.csv

Note on performance:

My initial tests gave sub-optimal performance:

2019-11-15 20:06:23,908 - root - INFO - 200 companies fetched (4.12 companies/s)
2019-11-15 20:07:12,282 - root - INFO - 300 companies fetched (6.20 companies/s)
2019-11-15 20:07:58,779 - root - INFO - 400 companies fetched (8.60 companies/s)

However, nowadays we have ~95k different cnpj_cpf values with 14 digits in Jarbas's reimbursements. In a pace of 6 companies per second, we can cover that dataset in roughly 4h.

cuducos commented 4 years ago

Just generated a new companies file with 125k different companies (the one current in use has 60k). This will not work on Jarbas just yet, but I'll open a PR over there. Meanwhile, can upload this file to our DigitalOcean Spaces? I don't have access anymore ; ) cc @sergiomario

sergiomario commented 4 years ago

@cuducos I can't access the file through the link you suggested. I'm being directed to a branch comparison page on github.

cuducos commented 4 years ago

Sorry, my bad! Fixed the link over there, but just in case: https://www.dropbox.com/s/8kxt3szujr3tksb/2019-11-19-companies.csv.xz?dl=0

sergiomario commented 4 years ago

Uploaded to the project storage at DigitalOcean Spaces: https://serenata-de-amor-data.nyc3.digitaloceanspaces.com/2019-11-19-companies.csv.xz

luizfzs commented 4 years ago

@cuducos are there any drawbacks for 'converting' the SQLite db into the 'companies' files? I see that the approach uses the DB as a way to query CNPJs found on the reimbursement files, right? Maybe it could 'export' the database into a 'companies' file, so that Rosie could reuse it.

cuducos commented 4 years ago

I opted for setting a DB and querying it in order to have a smaller file.

Postgres is already a bottleneck for Jarbas's performance. Updating Jarbas database is another bottleneck.

Thus I don't think using a +30Gb database dump (that could be filtered to a few Mb) would be the better choice… and I'm not sure we have enough disk space for that in production.

Also Rosie already has a memory bottleneck and loading the full dump would make this memory bottleneck even tighter.

Does that make sense or am I over engineering that?

luizfzs commented 4 years ago

It does make sense. I felt I was missing the bigger picture and your answer clarified that.

cuducos commented 4 years ago

@luizfzs do you think you can handle the PR on Jarbas? I can help with that.

Basically what is need is to adapt the Company to this new and updated CSV. I was wondering: maybe we can use JsonField for activities and partners and simplify the architecture. What do you think?

luizfzs commented 4 years ago

@cuducos well, I'd like to give it a try. Is there an issue created that I can refer to?

cuducos commented 4 years ago

I just suggested a road map at https://github.com/okfn-brasil/serenata-de-amor/issues/509#issuecomment-574730299 ; )

cuducos commented 3 years ago

I couldn't finish this contribution, but I believe it doesn't make sense to finish it anymore — at least not in the direction I started.

Now we can use minhaeceita.org API and maybe keep only the Open Street Maps's Nominatim API drafted here.

Gonna try to change substantially this PR, gonna --force some pushes to avoid keeping code that never hit the main branch, ok?

okfn-brasil / serenata-toolbox

Add company dataset abstraction #218