okfn-brasil / serenata-de-amor

🕵 Artificial Intelligence for social control of public administration | **This repository does not receive frequent updates. Check out the README**
https://serenata.ai/en
MIT License
4.51k stars 667 forks source link

Rosie stops mid-classification due to MemoryError in a 32gb ram machine #560

Open ogecece opened 2 years ago

ogecece commented 2 years ago

Rosie stopped tweeting a while back and that was the reason.

In the last weeks @andreformento diagnosed this locally and we tested it in the production infrastructure.

Here's the full traceback for executing python3 rosie.py run chamber_of_deputies in a common 8 vcpus 32gb ram Digital Ocean's Droplet:

2021-08-23 22:32:46,878 - rosie.chamber_of_deputies.adapter - INFO - Updating companies
Downloading 2016-09-03-companies.xz: 100%|████████████████████████████████████████████████████████████████████████████| 4.84M/4.84M [00:00<00:00, 34.5Mb/s]
2021-08-23 22:32:47,051 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2009
2021-08-23 22:33:05,802 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2010
2021-08-23 22:33:27,758 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2011
2021-08-23 22:33:52,820 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2012
2021-08-23 22:34:14,875 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2013
2021-08-23 22:34:39,627 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2014
2021-08-23 22:35:00,156 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2015
2021-08-23 22:35:24,343 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2016
2021-08-23 22:35:47,603 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2017
2021-08-23 22:36:10,159 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2018
2021-08-23 22:36:29,338 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2019
2021-08-23 22:36:47,928 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2020
2021-08-23 22:36:58,705 - rosie.chamber_of_deputies.adapter - INFO - Updating reimbursements from 2021
2021-08-23 22:37:07,120 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2018.csv
2021-08-23 22:37:08,965 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2014.csv
2021-08-23 22:37:11,514 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2010.csv
2021-08-23 22:37:14,283 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2020.csv
2021-08-23 22:37:16,251 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2012.csv
2021-08-23 22:37:19,527 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2013.csv
2021-08-23 22:37:23,982 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2011.csv
2021-08-23 22:37:29,628 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2009.csv
2021-08-23 22:37:33,911 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2015.csv
2021-08-23 22:37:39,087 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2021.csv
2021-08-23 22:37:43,265 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2017.csv
2021-08-23 22:37:50,403 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2019.csv
2021-08-23 22:37:57,065 - rosie.chamber_of_deputies.adapter - INFO - Loading reimbursements from /tmp/serenata-data/reimbursements-2016.csv
2021-08-23 22:38:03,934 - rosie.chamber_of_deputies.adapter - INFO - Loading companies
2021-08-23 22:38:22,833 - rosie.chamber_of_deputies.adapter - INFO - Categorizing reimbursements
2021-08-23 22:38:24,119 - rosie.chamber_of_deputies.adapter - INFO - Coercing issue_date column to date data type
2021-08-23 22:38:25,018 - rosie.chamber_of_deputies.adapter - INFO - Coercing situation_date column to date data type
2021-08-23 22:38:39,961 - rosie.chamber_of_deputies.adapter - INFO - Renaming columns to Serenata de Amor standard
2021-08-23 22:38:39,962 - rosie.chamber_of_deputies.adapter - INFO - Dataset ready! Rosie starts her analysis now :)
2021-08-23 22:39:10,942 - rosie.core - INFO - Running classifier 1 of 6: meal_price_outlier
2021-08-23 22:40:08,740 - rosie.core - INFO - Running classifier 2 of 6: over_monthly_subquota_limit
2021-08-23 22:44:21,321 - rosie.core - INFO - Running classifier 3 of 6: suspicious_traveled_speed_day
Traceback (most recent call last):
  File "rosie.py", line 64, in <module>
    main()
  File "rosie.py", line 60, in main
    run(module, arguments['--output'])
  File "rosie.py", line 34, in run
    module.main(directory)
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/__init__.py", line 9, in main
    core()
  File "/opt/serenata-de-amor/rosie/rosie/core/__init__.py", line 45, in __call__
    self.predict(model, name)
  File "/opt/serenata-de-amor/rosie/rosie/core/__init__.py", line 73, in predict
    prediction = model.predict(self.dataset)
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py", line 70, in predict
    is_outlier = self.__applicable_rows(_X) & \
  File "/opt/serenata-de-amor/rosie/rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py", line 100, in __applicable_rows
    X[['latitude', 'longitude']].notnull().all(axis=1)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2918, in __getitem__
    data = self._take_with_is_copy(indexer, axis=1)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3363, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3348, in take
    self._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5216, in _consolidate_inplace
    self._protect_consolidate(f)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5205, in _protect_consolidate
    result = f()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5214, in f
    self._mgr = self._mgr.consolidate()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 983, in consolidate
    bm._consolidate_inplace()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 988, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1909, in _consolidate
    list(group_blocks), dtype=dtype, can_consolidate=_can_consolidate
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1934, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

Due to this, the pipeline doesn't propagate further and Jarbas isn't updated. Therefore, no new data was being registered to be tweeted.

A PR (#561) has been opened to solve this temporarily, but any help would be appreciated in how we could reduce the memory consumption.

Kudos @andreformento!

andreformento commented 2 years ago

I created this PR https://github.com/okfn-brasil/serenata-de-amor/pull/562 to help to run using only last years :eyes: I know that is not a optimization, but it create a possibility to run with less resources