okfn-brasil / jarbas

🎩 API for information and suspicions about reimbursements by Brazilian congresspeople
https://jarbas.serenata.ai/
296 stars 61 forks source link

Use rows to serialize reimbursements #242

Closed cuducos closed 7 years ago

cuducos commented 7 years ago

This PR is part of the inclusion of rows as suggested in #211, starting with the process of importing reimbursements to Postgres.

Not sure if it's the best implementation so labeled it as an work in progress and I'd like to invite @turicas for a code review. In general my doubts are:

Is there a cleaver way to force conversion of data such as '2011.0' to int?

https://github.com/datasciencebr/jarbas/blob/7d030103f5c67dea214454173f22f16688a28497/jarbas/core/fields.py#L7-L16

Is there a cleaver way to force conversion of data such as '2014-02-16T00:00:00' to date?

https://github.com/datasciencebr/jarbas/blob/7d030103f5c67dea214454173f22f16688a28497/jarbas/core/fields.py#L19-L29

Can rows rename columns? Actually I'm converting each row (namedtuple) to a OrderedDict, then to a dict and them using this method:

https://github.com/datasciencebr/jarbas/blob/7d030103f5c67dea214454173f22f16688a28497/jarbas/core/management/commands/reimbursements.py#L53-L63


Thanks @priscilaportela for pairing!

turicas commented 7 years ago

@cuducos, answering the questions:

1- If this is supported right now, then the type identification algorithm will not know when it's an integer or float, leading to some problems. So, by now your way is the best one but I'd replace value = float(value) with value = int(float(value)) since it's expected an int object from this deserialize method.

2- Yes! You can force the INPUT_FORMAT attribute or even create a new class (I prefer the second option):

import rows
# First method
old_format = rows.fields.DateField.INPUT_FORMAT
rows.fields.DateField.INPUT_FORMAT = '%Y-%m-%dT%H:%M:%S'
value = rows.fields.DateField.deserialize('2014-02-16T00:00:00')
rows.fields.DateField.INPUT_FORMAT = old_format

#Second method
class MyDateField(rows.fields.DateField):
    INPUT_FORMAT = '%Y-%m-%dT%H:%M:%S'
value = MyDateField.deserialize('2014-02-16T00:00:00')

3- The current way of doing this is using rows.transform (you can see an example here) but I'm not sure this is the best solution for these cases (still need to find a better API). The use case of renaming/removing fields is very common, so I'm thinking in a more straightforward way of implementing this on the library. This question is also related to this issue: https://github.com/turicas/rows/issues/30 (it may be a better way of having the conversion you want, but it's not "officially" supported by the library).

cuducos commented 7 years ago

This weekend I gave this PR another try and the results were quite good when combined with Celery. The process of serializing data is basically done by rows, which is pretty fast compared to what we had before (30% faster in my local tests, cc @decko).

The PR includes two new libraries when compared to master, so in order to test it one have to temporary change the docker-compose.yml to use the Dockerfile:

  tasks:
-    image: datasciencebr/jarbas-backend
+    build:
+      context: .
    environment:

After this change just docker-compose build and then proceeds to the regular tests (in this case, importing reimbursements). If this is a merge, just revert the temporary changes to docker-compose.yml ; )

anaschwendler commented 7 years ago

Hi @cuducos @turicas !

Nice that we will have rows in our project, great project, and the collaboration is amazing. I'll test that PR my way, but already 👍 !

  1. Clone the project:

    $ git clone git@github.com:datasciencebr/jarbas.git
  2. Change to Jarbas folder:

    $ cd jarbas
  3. Checked out to @cuducos PR:

    $ git checkout -b cuducos-rows origin/cuducos-rows
  4. Merge with the master content:

    $ git merge master
  5. Run the project:

    $ docker-compose up -d

Change the docker-compose.yml file as @cuducos says in his last message

  1. Run docker-compose build again:
    $ docker-compose build

I think that the best way to test if it is working is by running the tests.

  1. Run the tests:
    $ docker-compose run --rm django python manage.py check
    $ docker-compose run --rm django python manage.py test

Result:

➜  jarbas git:(cuducos-rows) ✗ docker-compose run --rm django python manage.py test
Starting jarbas_queue_1 ... 
Starting jarbas_postgres_1 ... done
Starting jarbas_queue_1 ... done
Starting jarbas_tasks_1 ... done
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.........................................................................................................................................................................
----------------------------------------------------------------------
Ran 169 tests in 5.083s

OK
Destroying test database for alias 'default'...

Ok, got it, works fine, merging it!