Closed elipe17 closed 3 months ago
Attention: Patch coverage is 60.50955%
with 62 lines
in your changes missing coverage. Please review.
Project coverage is 91.07%. Comparing base (
1166030
) to head (2226a3e
). Report is 1 commits behind head on develop.
I get a duplicate key error
and lose all record data when trying to replicate the "double button click" (sequential execution) for which you removed the handling. just want to highlight the risk there
[2024-08-09 14:22:13,973: ERROR/ForkPoolWorker-12] Encountered Database exception in parser_task.py:
web-1 | duplicate key value violates unique constraint "parsers_datafilesummary_datafile_id_880a2f4d_uniq"
web-1 | DETAIL: Key (datafile_id)=(4) already exists.
@elipe17 I am still not clear why we need many to many relationship, and I would like to avoid them if possible. There are many things that can go wrong with them, and leave junk data in the DB. Many if you could elaborate more on why many-to-many is needed I can be convinced!
Both locally and on a11y, I am not getting a finished reparse run. I initially just used -a
to gather the dozen or so files i had uploaded then tried breaking it up by year, same results. I waited for logs to indicate no more parsing was happening before initiating the next run. Used task clean
prior to building so its completely fresh. a11y also has a fresh DB due to some issues. Will retry against raft env.
2024-08-15 11:55:21 [2024-08-15 15:55:21,346: INFO/ForkPoolWorker-52] DataFile parsing started for file ADS.E2J.NDM1.TS01
2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG fields.py::parse_value:L47 : Field: 'tribe_code' at position: [14, 17) is empty.
2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG fields.py::parse_value:L47 : Field: 'tribe_code' at position: [14, 17) is empty.
2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L46 : Datafile has encrypted fields: True.
2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L47 : Datafile: {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06}, is Tribal: False.
2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L51 : Program type: TAN, Section: A.
2024-08-15 11:55:21 2024-08-15 15:55:21,427 INFO parse.py::parse_datafile:L95 : Preparser Error -> Rpt Month Year is not valid: Submitted reporting year:2020, quarter:Q4 doesn't match file reporting year:2023, quarter:Q1.
2024-08-15 11:55:21 2024-08-15 15:55:21,427 DEBUG parse.py::bulk_create_errors:L155 : Bulk creating ParserErrors.
2024-08-15 11:55:21 2024-08-15 15:55:21,429 INFO parse.py::bulk_create_errors:L158 : Created 1/1 ParserErrors.
2024-08-15 11:55:21 2024-08-15 15:55:21,439 INFO parser_task.py::parse:L41 : Parsing finished for file -> {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06} with status Rejected and 1 errors.
2024-08-15 11:55:21 [2024-08-15 15:55:21,439: INFO/ForkPoolWorker-52] Parsing finished for file -> {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06} with status Rejected and 1 errors.
2024-08-15 11:55:21 [2024-08-15 15:55:21,441: INFO/ForkPoolWorker-52] Task tdpservice.scheduling.parser_task.parse[22642997-1f01-4f61-bc74-0cef780d0247] succeeded in 0.10376212500000292s: None
Both locally and on a11y, I am not getting a finished reparse run. I initially just used
-a
to gather the dozen or so files i had uploaded then tried breaking it up by year, same results. I waited for logs to indicate no more parsing was happening before initiating the next run. Usedtask clean
prior to building so its completely fresh. a11y also has a fresh DB due to some issues. Will retry against raft env.Screenshots
2024-08-15 11:55:21 [2024-08-15 15:55:21,346: INFO/ForkPoolWorker-52] DataFile parsing started for file ADS.E2J.NDM1.TS01 2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG fields.py::parse_value:L47 : Field: 'tribe_code' at position: [14, 17) is empty. 2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG fields.py::parse_value:L47 : Field: 'tribe_code' at position: [14, 17) is empty. 2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L46 : Datafile has encrypted fields: True. 2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L47 : Datafile: {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06}, is Tribal: False. 2024-08-15 11:55:21 2024-08-15 15:55:21,426 DEBUG parse.py::parse_datafile:L51 : Program type: TAN, Section: A. 2024-08-15 11:55:21 2024-08-15 15:55:21,427 INFO parse.py::parse_datafile:L95 : Preparser Error -> Rpt Month Year is not valid: Submitted reporting year:2020, quarter:Q4 doesn't match file reporting year:2023, quarter:Q1. 2024-08-15 11:55:21 2024-08-15 15:55:21,427 DEBUG parse.py::bulk_create_errors:L155 : Bulk creating ParserErrors. 2024-08-15 11:55:21 2024-08-15 15:55:21,429 INFO parse.py::bulk_create_errors:L158 : Created 1/1 ParserErrors. 2024-08-15 11:55:21 2024-08-15 15:55:21,439 INFO parser_task.py::parse:L41 : Parsing finished for file -> {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06} with status Rejected and 1 errors. 2024-08-15 11:55:21 [2024-08-15 15:55:21,439: INFO/ForkPoolWorker-52] Parsing finished for file -> {id: 21, filename: ADS.E2J.FTP1.TS06, STT: Alabama (01), S3 location: data_files/2023/Q1/1/Active Case Data/ADS.E2J.FTP1.TS06} with status Rejected and 1 errors. 2024-08-15 11:55:21 [2024-08-15 15:55:21,441: INFO/ForkPoolWorker-52] Task tdpservice.scheduling.parser_task.parse[22642997-1f01-4f61-bc74-0cef780d0247] succeeded in 0.10376212500000292s: None
@andrew-jameson the code that handles tracking failed files (think S3 exception we don't catch) or files that exit parsing early due to cat1 errors is in the follow on PR since it is required for sequential execution and not general metadata tracking.
Usability change for sysadmins and developers: Data Files page filter by a ReparseMeta model object.
for my own notes:
meta7 = ReparseMeta.objects.get(id=7)
datafiles = DataFile.objects.filter(reparse_meta_models=meta7)
# equivalent is ~ meta7.reparse_meta_models.all()
[print("{}:{}".format(d.stt,d.fiscal_year)) for d in datafiles]
## files associated with
Arkansas (05):2024 - Q1 (Oct - Dec)
Arkansas (05):2024 - Q1 (Oct - Dec)
Alabama (01):2024 - Q1 (Oct - Dec)
Chippewa-Cree Indians of the Rocky Boy's Reservation (043):2024 - Q1 (Oct - Dec)
Chippewa-Cree Indians of the Rocky Boy's Reservation (043):2024 - Q1 (Oct - Dec)
Chippewa-Cree Indians of the Rocky Boy's Reservation (043):2024 - Q1 (Oct - Dec)
Chippewa-Cree Indians of the Rocky Boy's Reservation (043):2024 - Q1 (Oct - Dec)
Florida (12):2024 - Q1 (Oct - Dec)
Florida (12):2024 - Q1 (Oct - Dec)
Alabama (01):2024 - Q1 (Oct - Dec)
Arkansas (05):2024 - Q1 (Oct - Dec)
Alabama (01):2024 - Q1 (Oct - Dec)
per standup today #3064 and #3065 work reflected in this PR. I started testing this morning.
@elipe17 @andrew-jameson @jtimpe @raftmsohani I'm currently blocked
on testing this PR in qasp environment. I attempted to reparse this morning for FY2023 Q1 and the operation was killed after the backup was completed. evidence below ⬇️
2024-08-24 13:54:44,539 INFO clean_and_reparse.py::__backup:L49 : Backup complete! Commencing clean and reparse.
Backup complete! Commencing clean and reparse.
Killed
I then tried another quarter: FY2023Q2 and couldn't proceed:
vcap@fc474368-25ed-4bfd-51b7-c201:~$ python manage.py clean_and_reparse -y 2023 -q Q2
You have selected to reparse datafiles for FY 2023 and Q2. The reparsed files will NOT be stored in new indices and the old indices
These options will delete and reparse (20) datafiles.
Continue [y/n]? y
The latest ReparseMeta model's (ID: 2) timeout_at field is None. Cannot safely execute reparse, please fix manually.
Worth noting that FY23Q1 has a couple of large files that should generate a lot of errors, so I'd like to see how this operation performs before this in prod.
@elipe17 @andrew-jameson @jtimpe @raftmsohani I'm currently
blocked
on testing this PR in qasp environment. I attempted to reparse this morning for FY2023 Q1 and the operation was killed after the backup was completed. evidence below ⬇️2024-08-24 13:54:44,539 INFO clean_and_reparse.py::__backup:L49 : Backup complete! Commencing clean and reparse. Backup complete! Commencing clean and reparse. Killed
I then tried another quarter: FY2023Q2 and couldn't proceed:
vcap@fc474368-25ed-4bfd-51b7-c201:~$ python manage.py clean_and_reparse -y 2023 -q Q2 You have selected to reparse datafiles for FY 2023 and Q2. The reparsed files will NOT be stored in new indices and the old indices These options will delete and reparse (20) datafiles. Continue [y/n]? y The latest ReparseMeta model's (ID: 2) timeout_at field is None. Cannot safely execute reparse, please fix manually.
Worth noting that FY23Q1 has a couple of large files that should generate a lot of errors, so I'd like to see how this operation performs before this in prod.
@ADPennington I updated the meta model in qasp so that you can continue testing. The Killed
console output indicates to me that the process was killed for some reason. I can't go far enough back in the logs to see if I can see exactly what happened.
@elipe17 latest test notes/questions below ⬇️ I didn't observe anything that needs to be addressed in this ticket; this is mostly for my SA.
Is there a way to know which source file id(s) are associated with the difference between the deleted/created counts? Looks like after the reparsing, the record count is different, which will sometimes be the case when validation is updated, but I imagine we'd also want to be able to investigate files to check if something went wrong? (see below):
what's the difference between total # of records initial and # of records created? Is one capturing the number of records in files vs number of records in the db after reparsing?
are we replacing the records in the db or adding new records? After reparsing FY23Q3, I see 6628 TANF T4s, and 3314 "new" TANF T4s. I was kind of expecting to see only "new" TANF T4s == "all" TANF T4s for this fiscal period. I'm assuming this is because more than one version of the FY23Q3 file was subject to reparsing? (see below). If true, this is another good justification for why we want to control which versions get reparsed (i.e. most recent :smile:)
mentioned this async too, so this is just for reference, would be helpful for admins to know how to "fix manually" when observe logentries like the following: The latest ReparseMeta model's (ID: 2) timeout_at field is None. Cannot safely execute reparse, please fix manually.
@elipe17 latest test notes/questions below ⬇️ I didn't observe anything that needs to be addressed in this ticket; this is mostly for my SA.
- Is there a way to know which source file id(s) are associated with the difference between the deleted/created counts? Looks like after the reparsing, the record count is different, which will sometimes be the case when validation is updated, but I imagine we'd also want to be able to investigate files to check if something went wrong? (see below):
- what's the difference between total # of records initial and # of records created? Is one capturing the number of records in files vs number of records in the db after reparsing?
- are we replacing the records in the db or adding new records? After reparsing FY23Q3, I see 6628 TANF T4s, and 3314 "new" TANF T4s. I was kind of expecting to see only "new" TANF T4s == "all" TANF T4s for this fiscal period. I'm assuming this is because more than one version of the FY23Q3 file was subject to reparsing? (see below). If true, this is another good justification for why we want to control which versions get reparsed (i.e. most recent 😄)
- mentioned this async too, so this is just for reference, would be helpful for admins to know how to "fix manually" when observe logentries like the following:
The latest ReparseMeta model's (ID: 2) timeout_at field is None. Cannot safely execute reparse, please fix manually.
@ADPennington, see my responses below :).
The record count is/can be different for files that have not been cat4 validated. Since records with cat4 errors don't get serialized to the DB we can expect to see the "num records" fields to not always be a one-to-one match since cat4 is relatively new. We can write a spike ticket to investigate the feasibility of tracking before and after record counts for files, this ticket might be a way for us to get that information. As an intermediary, I have also written this ticket which adds some more useful fields. Specifically tracking cat4 errors before and after the reparse will help illuminate if it makes sense that the record counts have diverged.
The two fields Total num records initial
and Total num records post
indicate the total number of records in the DB before and after the reparse event. The Total num records deleted
field indicates how many records this reparse event deleted from the DB and the Total num records created
indicates how many records were re-created during the reparse event.
For reparsing, all records associated with the selected files are deleted from the DB and then recreated. No record duplication should be occuring.
See the steps below to "fix manually"
latest = ReparseMeta.get_latest()
latest.timeout_at = timezone.now()
latest.save()
- The record count is/can be different for files that have not been cat4 validated. Since records with cat4 errors don't get serialized to the DB we can expect to see the "num records" fields to not always be a one-to-one match since cat4 is relatively new. We can write a spike ticket to investigate the feasibility of tracking before and after record counts for files, this ticket might be a way for us to get that information. As an intermediary, I have also written this ticket which adds some more useful fields. Specifically tracking cat4 errors before and after the reparse will help illuminate if it makes sense that the record counts have diverged.
per async with @elipe17 #3096 is the ticket intended to capture more details about cat1 and cat4 errors in data file summaries. linking just for reference to related ideas 😄
Summary of Changes
-a
enforces new indices and that is the only time they are recreated.How to Test
Deliverables
More details on how deliverables herein are assessed included here.
Deliverable 1: Accepted Features
Checklist of ACs:
clean_and_reparse
commandDeliverable 2: Tested Code
CodeCov Report
comment in PR)CodeCov Report
comment in PR)Deliverable 3: Properly Styled Code
Deliverable 4: Accessible
iamjolly
andttran-hub
using Accessibility Insights reveal any errors introduced in this PR?Deliverable 5: Deployed
Deliverable 6: Documented
Deliverable 7: Secure
Deliverable 8: User Research
Research product(s) clearly articulate(s):