okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Add company dataset abstraction #218

Open cuducos opened 4 years ago

cuducos commented 4 years ago

What is the purpose of this Pull Request?

This initial commit adds an abstraction to build the company dataset. It doesn't include tests yet because of the exploitative way I approached the problem. But tests are listed in the TODO list below.

What was done to achieve this purpose?

How to test if it really works?

$ serenata-toolbx --modules companies

This will look for the datasets in data/ with cnpj_cpf columns, take unique values of cnpj_cpf in these datasets and generate a YYYY-MM-DD-companies.csv.xz.

Who can help reviewing it?

@sergiomario @g4brielvs @turicas

TODO

Note on performance:

My initial tests gave sub-optimal performance:

2019-11-15 20:06:23,908 - root - INFO - 200 companies fetched (4.12 companies/s)
2019-11-15 20:07:12,282 - root - INFO - 300 companies fetched (6.20 companies/s)
2019-11-15 20:07:58,779 - root - INFO - 400 companies fetched (8.60 companies/s)

However, nowadays we have ~95k different cnpj_cpf values with 14 digits in Jarbas's reimbursements. In a pace of 6 companies per second, we can cover that dataset in roughly 4h.

cuducos commented 4 years ago

Just generated a new companies file with 125k different companies (the one current in use has 60k). This will not work on Jarbas just yet, but I'll open a PR over there. Meanwhile, can upload this file to our DigitalOcean Spaces? I don't have access anymore ; ) cc @sergiomario

sergiomario commented 4 years ago

@cuducos I can't access the file through the link you suggested. I'm being directed to a branch comparison page on github.

cuducos commented 4 years ago

Sorry, my bad! Fixed the link over there, but just in case: https://www.dropbox.com/s/8kxt3szujr3tksb/2019-11-19-companies.csv.xz?dl=0

sergiomario commented 4 years ago

Uploaded to the project storage at DigitalOcean Spaces: https://serenata-de-amor-data.nyc3.digitaloceanspaces.com/2019-11-19-companies.csv.xz

luizfzs commented 4 years ago

@cuducos are there any drawbacks for 'converting' the SQLite db into the 'companies' files? I see that the approach uses the DB as a way to query CNPJs found on the reimbursement files, right? Maybe it could 'export' the database into a 'companies' file, so that Rosie could reuse it.

cuducos commented 4 years ago

I opted for setting a DB and querying it in order to have a smaller file.

Postgres is already a bottleneck for Jarbas's performance. Updating Jarbas database is another bottleneck.

Thus I don't think using a +30Gb database dump (that could be filtered to a few Mb) would be the better choice… and I'm not sure we have enough disk space for that in production.

Also Rosie already has a memory bottleneck and loading the full dump would make this memory bottleneck even tighter.

Does that make sense or am I over engineering that?

luizfzs commented 4 years ago

It does make sense. I felt I was missing the bigger picture and your answer clarified that.

cuducos commented 4 years ago

@luizfzs do you think you can handle the PR on Jarbas? I can help with that.

Basically what is need is to adapt the Company to this new and updated CSV. I was wondering: maybe we can use JsonField for activities and partners and simplify the architecture. What do you think?

luizfzs commented 4 years ago

@cuducos well, I'd like to give it a try. Is there an issue created that I can refer to?

cuducos commented 4 years ago

I just suggested a road map at https://github.com/okfn-brasil/serenata-de-amor/issues/509#issuecomment-574730299 ; )

cuducos commented 3 years ago

I couldn't finish this contribution, but I believe it doesn't make sense to finish it anymore — at least not in the direction I started.

Now we can use minhaeceita.org API and maybe keep only the Open Street Maps's Nominatim API drafted here.

Gonna try to change substantially this PR, gonna --force some pushes to avoid keeping code that never hit the main branch, ok?