okfn-brasil / jarbas

🎩 API for information and suspicions about reimbursements by Brazilian congresspeople
https://jarbas.serenata.ai/
296 stars 61 forks source link

Delegate reimbursement import to Celery #240

Closed cuducos closed 7 years ago

cuducos commented 7 years ago

This is a work in progress, I just created it earlier to have feedbacks. The aim, when complete, is to address part of #217.

What is the purpose of this Pull Request? The purpose of this PR is make the process of importing reimbursement quicker and without manual steps as described in #217.

What was done to achieve this purpose? I introduced Celery to perform some task asynchronously: that way the heavy work of the import is handled in the background.

How to test if it really works?

  1. Install RabbitMQ if not using Docker — brew install rabbitmq or apt-get install rabbitmq-server should do the job for macOS and any Debian based Linux such as Ubuntu (Docker environment is ready, no need to worry about it)
  2. Install new packages: $ python -m pip install -U -r requirements.txt
  3. Run tests $ python manage.py test
  4. Get Celery started: $ celery worker --app jarbas
  5. Run reimbursement import: $ python manage.py reimbursements <path to reimbursements.xz>

Who can help reviewing it? @jtemporal @lipemorais @davinirjr @augustogoulart @adrianomargarin

What is out of scope In #217 I mentioned creating a API endpoint so Rosie can send POSTs with reimbursement data. I think this is an issue for a new PR when this one is finished, reviewed, stable an merged.

Next steps in this PR

Next steps

decko commented 7 years ago

Testing with sample data seems Ok. Ready to Merge.

cuducos commented 7 years ago

Testing with sample data seems Ok. Ready to Merge.

In PVT @decko told me that actually it wasn't working with the Celery (worker) running in a container. So we still need to fix that before considering merging it. Is that right?

cuducos commented 7 years ago

Ok… I'm back to this PR.

I merged the novelties from master branch and studied this PR a bit more to advance it further.

The master branch imported the sample in 34s (counting the time to start Docker containers) and this branch did it in 24s (idem). We were saving 30% of the time, which is great.

Then I moved the serialize part to the async task (42e512a). The importing command took 15s — we're saving more than 50% of time now, yay!

From that I think the Celery part is indeed working and this PR is good for code review and, if that's the case, merging… cc @anaschwendler

Many thanks @decko for the tests and the ideas ; )


UPDATE

How to test this PR:

  1. First I'd run python manage.py reimbursements /mnt/data/reimbursements_sample.xz in the master branch, and time it (basically you can run time python manage.py reimbursements /mnt/data/reimbursements_sample.xz)
  2. Then I'd move to cuducos-celery branch, delete all reimbursements (python manage.py shell_plus and then Reimbursements.objects.all().delete() might do the job), and time the import again.

If at the end of the process the a) the importing was quicker in this branch when compared to master, and b) reimbursements were actually saved in the database (python manage.py shell_plus and then Reimbursements.objects.count() might do the job), then it's working as predicted ; )

anaschwendler commented 7 years ago

Hello @cuducos and @decko!

I'll start testing this PR little by little and how (so it will be edited along the way):

  1. Clone the repository:

    $ git clone git@github.com:datasciencebr/jarbas.git
  2. Open the repo folder:

    $ cd jarbas

I'll try to run with Docker, even if its showed without, I'll try :)

  1. Copy the .env file:

    $ cp contrib/.env.sample env
  2. Build and start services:

    $ docker-compose up -d
  3. Create the database and apply the migrations:

    $ docker-compose run --rm django python manage.py migrate

I'll start testing as @cuducos suggest in the last comment:

  1. Run python manage.py reimbursements /mnt/data/reimbursements_sample.xz in the master branch, and time it:
    $ time docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.xz

And the result:

Starting jarbas_postgres_1 ... 
Starting jarbas_postgres_1 ... done
Starting with 0 reimbursements
Processed: 1000 lines                                                                                                                        Updated: 0 reimbursements                                                                                                                    Created: 1000 reimbursements                                                                                                                 Skip: 0 reimbursements                                                                                                                       
docker-compose run --rm django python manage.py reimbursements   1.13s user 0.34s system 11% cpu 13.221 total

Ok, now let's test with @cuducos PR:

  1. Checkout to @cuducos branch:

    $ git checkout -b cuducos-celery origin/cuducos-celery
  2. Update the branch:

    $ git merge master
  3. Delete all reimbursements:

    $ docker-compose run --rm django python manage.py shell_plus
    > Reimbursement.objects.all().delete()

The result:

>>> Reimbursement.objects.all().delete()
(1000, {'core.Tweet': 0, 'core.Reimbursement': 1000})
  1. Import again:
    $ time docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.xz

The result:

Starting jarbas_queue_1 ... 
Starting jarbas_queue_1 ... done
Starting jarbas_postgres_1 ... done
Starting jarbas_tasks_1 ... done
1000 reimbursements scheduled to creation/update                       
docker-compose run --rm django python manage.py reimbursements   1.11s user 0.26s system 13% cpu 9.829 total

It gave me a diference of 2secs by testing 3 times and getting the average time 🎉

Is it ok @cuducos ? Can I merge?

davinirjr commented 7 years ago

Nice! Good work, people, congrats!