raft-tech / TANF-app

Repo for development of a new TANF Data Reporting System
Other
17 stars 4 forks source link

Reparse Memory Management #3172

Closed elipe17 closed 2 months ago

elipe17 commented 2 months ago

Summary of Changes

How to Test

List the steps to test the PR These steps are generic, please adjust as necessary.

cd tdrs-frontend && docker-compose up
cd tdrs-backend && docker-compose up
  1. Open http://localhost:3000/ and sign in.
  2. Submit ADS.E2J.NDM1.TS53_fake.txt four or five times to get one million or more records into the DB. (This goes faster if you add more celery workers to gunicorn_start.sh!)
  3. Exec into web and run python manage.py clean_and_reparse -y 2023. If you use the -a argument you actually won't see the issue occur because -a doesn't pass the queryset to Elastic DSL which is what causes it to come into memory.
  4. Monitor the backend memory consumption while deleting records. It should increase, but only around 50-100MB.

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

Deliverable 2: Tested Code

Deliverable 3: Properly Styled Code

Deliverable 4: Accessible

Deliverable 5: Deployed

Deliverable 6: Documented

Deliverable 7: Secure

Deliverable 8: User Research

Research product(s) clearly articulate(s):

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.66%. Comparing base (95fc24b) to head (a981311). Report is 49 commits behind head on develop.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172/graphs/tree.svg?width=650&height=150&src=pr&token=BA04YXPAL9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech)](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech) ```diff @@ Coverage Diff @@ ## develop #3172 +/- ## ======================================== Coverage 92.66% 92.66% ======================================== Files 47 47 Lines 1009 1009 Branches 169 169 ======================================== Hits 935 935 Misses 42 42 Partials 32 32 ``` | [Flag](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech) | Coverage Δ | | |---|---|---| | [dev-frontend](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech) | `92.66% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech#carryforward-flags-in-the-pull-request-comment) to find out more. ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech). Last update [95fc24b...a981311](https://app.codecov.io/gh/raft-tech/TANF-app/pull/3172?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=raft-tech).
ADPennington commented 2 months ago

i will revisit monday but results of first attempt below @elipe17 :

vcap@:~$ python manage.py clean_and_reparse -y 2023 -q Q1

You have selected to reparse datafiles for FY 2023 and Q1. The reparsed files will NOT be stored in new indices and the old indices
These options will delete and reparse (90) datafiles.
Continue [y/n]? y

INSIDE FILE COUNTS MATCH:
66, 66, 0

INSIDE FILE COUNTS MATCH:
66, 66, 0

2024-09-20 21:39:21,390 INFO clean_and_reparse.py::_backup:L48 :  Beginning reparse DB Backup.
Beginning reparse DB Backup.
2024-09-20 21:39:21,394 INFO db_backup.py::get_system_values:L54 :  Using postgres client at: /home/vcap/deps/0/apt/usr/lib/postgresql/15/bin/
Using postgres client at: /home/vcap/deps/0/apt/usr/lib/postgresql/15/bin/
2024-09-20 21:39:21,396 INFO db_backup.py::backup_database:L89 :  Executing backup command: /home/vcap/deps/0/apt/usr/lib/postgresql/15/bin/pg_dump -Fc --no-acl -f /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg -d postgres://.rds.amazonaws.com:5432/tdp_db_qasp
Executing backup command: /home/vcap/deps/0/apt/usr/lib/postgresql/15/bin/pg_dump -Fc --no-acl -f /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg -d postgres://.rds.amazonaws.com:5432/tdp_db_qasp
2024-09-20 21:40:07,398 INFO db_backup.py::backup_database:L94 :  Successfully executed backup. Wrote pg dumpfile to /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg
Successfully executed backup. Wrote pg dumpfile to /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg
2024-09-20 21:40:07,577 INFO db_backup.py::backup_database:L104 :  Pg dumpfile size in bytes: 283788764.
Pg dumpfile size in bytes: 283788764.
2024-09-20 21:40:07,577 INFO db_backup.py::upload_file:L176 :  Uploading /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg to S3.
Uploading /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg to S3.
2024-09-20 21:40:10,198 INFO db_backup.py::upload_file:L189 :  Successfully uploaded /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg to s3:///backup/tmp/reparsing_backup_FY_2023_Q1_rpv8.pg.
Successfully uploaded /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg to s3:///backup/tmp/reparsing_backup_FY_2023_Q1_rpv8.pg.
2024-09-20 21:40:10,201 INFO db_backup.py::main:L329 :  Deleting /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg from local storage.
Deleting /tmp/reparsing_backup_FY_2023_Q1_rpv8.pg from local storage.
2024-09-20 21:40:10,223 INFO backup_db.py::handle:L36 :  Cloud backup/restore job complete.
Cloud backup/restore job complete.
2024-09-20 21:40:10,224 INFO clean_and_reparse.py::_backup:L50 :  Backup complete! Commencing clean and reparse.
Backup complete! Commencing clean and reparse.
2024-09-20 21:40:12,140 INFO clean_and_reparse.py::_delete_summaries:L88 :  Deleting 3 datafile summary objects.
Deleting 3 datafile summary objects.
2024-09-20 21:40:12,158 INFO clean_and_reparse.py::_delete_summaries:L90 :  Successfully deleted datafile summary objects.
Successfully deleted datafile summary objects.
2024-09-20 21:40:13,120 INFO clean_and_reparse.py::_delete_errors:L148 :  Deleting 33 parser errors.
Deleting 33 parser errors.
2024-09-20 21:40:21,911 INFO clean_and_reparse.py::_delete_errors:L150 :  Successfully deleted parser errors.
Successfully deleted parser errors.
2024-09-20 21:40:22,210 INFO clean_and_reparse.py::_delete_records:L113 :  Deleting 863643 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T1'>.
Deleting 863643 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T1'>.
Elastic document delete failed for type <class 'tdpservice.search_indexes.models.tanf.TANF_T1'>. The database and Elastic are INCONSISTENT! Restore the DB from the backup as soon as possible!
Traceback (most recent call last):
  File "/home/vcap/app/manage.py", line 31, in <module>
    main()
  File "/home/vcap/app/manage.py", line 27, in main
    execute_from_command_line(sys.argv)
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/home/vcap/app/tdpservice/search_indexes/management/commands/clean_and_reparse.py", line 352, in handle
    self._delete_associated_models(meta_model, file_ids, new_indices, log_context)
  File "/home/vcap/app/tdpservice/search_indexes/management/commands/clean_and_reparse.py", line 168, in _delete_associated_models
    num_deleted = self._delete_records(file_ids, new_indices, log_context)
  File "/home/vcap/app/tdpservice/search_indexes/management/commands/clean_and_reparse.py", line 128, in _delete_records
    raise e
  File "/home/vcap/app/tdpservice/search_indexes/management/commands/clean_and_reparse.py", line 121, in _delete_records
    doc().update(page.object_list, refresh=True, action='delete')
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django_elasticsearch_dsl/documents.py", line 238, in update
    return self._bulk(
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django_elasticsearch_dsl/documents.py", line 215, in _bulk
    return self.bulk(*args, **kwargs)
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/django_elasticsearch_dsl/documents.py", line 164, in bulk
    response = bulk(client=self._get_connection(), actions=actions, **kwargs)
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 410, in bulk
    for ok, item in streaming_bulk(
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 329, in streaming_bulk
    for data, (ok, info) in zip(
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 256, in _process_bulk_chunk
    for item in gen:
  File "/home/vcap/deps/1/python/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 187, in _process_bulk_chunk_success
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
elasticsearch.helpers.errors.BulkIndexError: ('4 document(s) failed to index.', [{'delete': {'_index': 'tdp-backend-qasp_tanf_t1_submissions_2024-08-22_13.00.26', '_type': '_doc', '_id': '00195471-f964-4e67-8aa2-2ad2fa4a15e0', '_version': 1, 'result': 'not_found', 'forced_refresh': True, '_shards': {'total': 1, 'successful': 1, 'failed': 0}, '_seq_no': 874432, '_primary_term': 1, 'status': 404}}, {'delete': {'_index': 'tdp-backend-qasp_tanf_t1_submissions_2024-08-22_13.00.26', '_type': '_doc', '_id': '00198e91-d333-42dd-8f2f-367ad7a87114', '_version': 1, 'result': 'not_found', 'forced_refresh': True, '_shards': {'total': 1, 'successful': 1, 'failed': 0}, '_seq_no': 874437, '_primary_term': 1, 'status': 404}}, {'delete': {'_index': 'tdp-backend-qasp_tanf_t1_submissions_2024-08-22_13.00.26', '_type': '_doc', '_id': '001f3921-13ed-427d-a95a-7534ef3c7b4f', '_version': 1, 'result': 'not_found', 'forced_refresh': True, '_shards': {'total': 1, 'successful': 1, 'failed': 0}, '_seq_no': 874516, '_primary_term': 1, 'status': 404}}, {'delete': {'_index': 'tdp-backend-qasp_tanf_t1_submissions_2024-08-22_13.00.26', '_type': '_doc', '_id': '00207bbc-4b67-4666-88ab-1f4dada0fb70', '_version': 1, 'result': 'not_found', 'forced_refresh': True, '_shards': {'total': 1, 'successful': 1, 'failed': 0}, '_seq_no': 874534, '_primary_term': 1, 'status': 404}}])
vcap@:~$
ADPennington commented 2 months ago

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

ADPennington commented 2 months ago

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

ADPennington commented 2 months ago

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

per @elipe17, and with the help of cloudgov support, the dev environment rds service instance was upgraded to medium-gp-psql with storage_type=gp3 to support qasp review. Will re-try the test today.

ADPennington commented 2 months ago

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

per @elipe17, and with the help of cloudgov support, the dev environment rds service instance was upgraded to medium-gp-psql with storage_type=gp3 to support qasp review. Will re-try the test today.

Screenshot 2024-09-26 160312

ADPennington commented 2 months ago

QASP review update:

this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

https://cloud.gov/docs/services/relational-database/

another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

ADPennington commented 2 months ago

QASP review update:

this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

  • Dev: from micro-psql to medium-gp-psql
  • Staging: from micro-psql to medium-gp-psql
  • Prod: from medium-psql to medium-gp-psql or large-gp-psql

https://cloud.gov/docs/services/relational-database/

another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

we received confirmation that, at least for now, there are no extra costs for the upgrade. @elipe17 should we proceed with making the terraform changes to staging and prod in this ticket?

elipe17 commented 2 months ago

QASP review update: this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

  • Dev: from micro-psql to medium-gp-psql
  • Staging: from micro-psql to medium-gp-psql
  • Prod: from medium-psql to medium-gp-psql or large-gp-psql

https://cloud.gov/docs/services/relational-database/ another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

we received confirmation that, at least for now, there are no extra costs for the upgrade. @elipe17 should we proceed with making the terraform changes to staging and prod in this ticket?

@ADPennington I think we should. If/when cost comes into the equation we can change the TF again to mitigate that. Since we don't have to take any manual steps to change the service plan this is easy to manipulate and change as our needs change. I will go ahead and make the changes. Note the tests are going to fail on this after a I make changes until this PR merges.