Delegate serialize to Celery on data import

okfn-brasil / jarbas

🎩 API for information and suspicions about reimbursements by Brazilian congresspeople

296 stars 61 forks source link

What is the purpose of this Pull Request? When trying to import data on the production server it looks like rows was unable to load and convert all data types for the whole 1.6 million lines of the source CSV. At this point the process just freezes after uncompressing the .xz file. This PR tries to fix this issue.

What was done to achieve this purpose? The huge CSV is loaded line by line with native csv.DictReader and all fields are kept as raw/string format when passed to Celery. Then the conversion of data types happens line by line in the async/background process (not for the whole CSV as rows default behavior).

How to test if it really works? Locally just run the command to import reimbursements and checks the database or the UI to make sure it worked. To have an idea if this is gonna work in the production server one might try to run it in a virtual machine with 4Gb of RAM (personally I haven't tested it, but once it's merged I can run the command in production and check it).

Who can help reviewing it? @anaschwendler @irio

Hi @cuducos, what I plan to do to test this PR:

Locally just run the command to import reimbursements and checks the database or the UI to make sure it worked.

And I'll let for you to check if it is working:

(personally I haven't tested it, but once it's merged I can run the command in production and check it)

What I did to test this PR:

Clone the repository:

$ git clone git@github.com:datasciencebr/jarbas.git

Open the repo folder:
```
$ cd jarbas
```

Checkout to @cuducos branch:

$ git checkout -b cuducos-lazy-rows origin/cuducos-lazy-rows

Update the branch:
```
$ git merge master
```
Copy the .env file:
```
$ cp contrib/.env.sample env
```
Build and start services:
```
$ docker-compose up -d
```

Create the database and apply the migrations:

$ docker-compose run --rm django python manage.py migrate
$ docker-compose run --rm django python manage.py searchvector

Seeding it with sample data:

$ docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.xz
$ docker-compose run --rm django python manage.py companies /mnt/data/companies_sample.xz
$ docker-compose run --rm django python manage.py suspicions /mnt/data/suspicions_sample.xz
$ docker-compose run --rm django python manage.py tweets

Checks the database or the UI to make sure it worked. To do that I'll access localhost:8000/dashboard/:

And it seems to be working, for me it look ok :)

okfn-brasil / jarbas

Delegate serialize to Celery on data import #282