TANF Data Parsing - Githubissues

Description: Provide a brief background and justification for this issue To support #86, #84, and #88 we will need a data parser for the TANF data being uploaded to TDP.

Acceptance Criteria: Create a list of functional outcomes that must be achieved to complete this issue

[ ] Uploaded file contents can be loaded into a Python data structure
[ ] Incorrect input results in specific human-readable errors
[ ] As an STT user, I can find my file error results
[ ] Placeholder
[ ] Testing Checklist has been run and all tests pass
[ ] README is updated, if necessary

Tasks: Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue

[ ] Determine data structure design
[ ] Determine if custom python logic is warranted or configuring existing parsers such as regex/bison
[ ] Determine DB storage of data structure details
[ ] Cross-reference existing TDRS errors with our parsing errors
[ ] Run Testing Checklist and confirm all tests pass

Notes: Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this FILE SUBMISSION TYPES (and parsing engines needed): The following 4 data files types will each need to be parsed and have parsing errors associated.

Some states submit TANF and SSP data. Therefore we need a dropdown for users to select the data type they are submitting.

Active Case Data- Micro data (case- and person- level)
- TANF – all states and territories
- Tribal TANF – Only 75 tribes as of FY20 (subject to change)
- SSP- 16 states and territories as of FY20 (subject to change)
Closed Case Data - Micro data (case- and person- level)
- parsing rules the same for TANF and Tribal TANF
Aggregate Data – Aggregate data by caseload composition and other characteristics (covers full caseload)
- parsing rules the same for TANF and Tribal TANF
Stratum Data – Aggregate data by strata. Represent the various groups and ways to use the sample data (covers full caseload).
- Only sample states submit this (22 of them- who is documented in STT attributes sheet). Some states submit a sample of caseload data for sections 1 and 2. They are called 'Sample states'. If a state submits sample data, then required to also submit Stratum data.
- States that submit universe data do NOT have to submit stratum data.

TECHNICAL ERROR TYPES:

Pre-parsing errors -- These types of errors are considered to be violations of the expected record layout, so the records are not "eligible" to be parsed or validated. See existing reference material (in UX murals and parsing notebooks) about what these errors are and how to detect them.

Should disallow garbage file types to be uploaded.
Should assess if a file submission is valid for parsing

Post-parsing error descriptions See below

Supporting Documentation: Spec Overview page for Parsing

Logic for Parsing and file formats Provided by OFA Here:

TANF_Parsing.html
[Tribal_TANF_Parsing.html]()
SSP_Parsing.html

Example Data for File upload formats:

Research and Design Files: Parsings Error Blocker Analysis 06292021: https://app.mural.co/t/raft2792/m/raft2792/1624455269095/a03c7a41f537ee50530451e7cb2a26fa11d2c9e6?sender=u64c7132cff9878e9eb088109 Parsing Blocker Error Communication Iteration Workshop 07142021: https://app.mural.co/t/raft2792/m/raft2792/1625859924911/4c37f4b0377d433d509dd5555a8c04086c16d08f?sender=u64c7132cff9878e9eb088109

Open Questions: Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete

Is there an existing exhaustive reference of errors?
Do we want to store PII in our database at all per #127? If so does it need to be encrypted? If not, do we just re-parse the file?

@lfrohlich Questions for Product about parsing. Is our goal with parsing data in TDP to: a. Allow users to submit files even if there are errors, and submit the rows that don't have issues, then make them come back to resubmit and fix error rows only. or b. Not allow a submission to happen or be complete until all errors are resolved.

If its A. We could have a submission status as 'submitted with errors' or 'submitted but incomplete'. It only gets processed once errors are fixed. If its B. we need less states for the submission, we just reject them with a list of errors they need to fix.

Currently, in TDRS, my understanding is that we allow STT's to submit with errors, OFA staff manually messages them with a list of errors, and they have 45 days to submit a data package that is error free.

It seems to me that A is ideal from a user experience. We allow them to submit-> the data is parsed and given a submission status->we give them errors for them to fix and resubmit -> (They either resubmit a full package, or a partial one).

cc @dk-ui , @ADPennington

Some context + reflections:

Is there an existing exhaustive reference of errors?

We're working on it 😄. This is a bit tricky because our goal for TDP is to detect errors: within records, across related records within a section, across sections, and ultimately across records and sections over time. So the current reference material spans public and internal information stores, and the DIGIT team is focused on updating these + creating a comprehensive codebook of errors that draw from the references below:

Note that some of these errors are not as feasible to detect at that point of submission, and because these checks build on each other (and increase in complexity), we have developed a prioritization strategy for TDP implementation purposes. The priority list is included below. Also notebooks that walk through the logic of detecting these errors will be made available to the team.

Pre-parsing errors -- These types of errors are considered to be violations of the expected record layout, so the records are not "eligible" to be parsed or validated. See existing reference material (in UX murals and parsing notebooks) about what these errors are and how to detect them.
Out-of-range value errors – These are based on the abovementioned instructions.
Errors re: inconsistent values across data elements within a record – These are also based on the abovementioned instructions (e.g. If SSI recipient = yes, then SSI amount received > $0).
Errors re: inconsistent values across related records within a section file – These errors are also based on the abovementioned instructions (e.g. For every family (T1) record for a given month, there is no evidence that at least one adult (T2) or child (T3) associated with the family's case (T1) is a TANF recipient)
Errors re: inconsistent values across related sections of data – These errors are based on DIGIT-generated checks, and some reference material included in abovementioned feedback rpts. Because sections of data can be submitted at different points in time, current thinking around these checks suggest that these errors would need to be checked against data from the dB (e.g. total #of families reported in Section 1 > total # families reported in Section 3)
Errors re: inconsistent values across related records and/or sections over time -- Also based on DIGIT-generated checks and would benefit from checks against data from the dB (e.g. state did not submit enough case records to meet annual sample size requirements)

Do we want to store PII in our database at all per As an OFA analyst, I need submitted files to make their way to the database. #127? If so does it need to be encrypted? If not, do we just re-parse the file?

yes. OFA analysts need access to all parsed data stored in dB -- including PII -- for routine business purposes. I believe the AWS RDS where parsed data will be stored are encrypted at rest by default.

@lfrohlich Questions for Product about parsing. Is our goal with parsing data in TDP to: a. Allow users to submit files even if there are errors, and submit the rows that don't have issues, then make them come back to resubmit and fix error rows only. or b. Not allow a submission to happen or be complete until all errors are resolved. If its A. We could have a submission status as 'submitted with errors' or 'submitted but incomplete'. It only gets processed once errors are fixed. If its B. we need less states for the submission, we just reject them with a list of errors they need to fix. Currently, in TDRS, my understanding is that we allow STT's to submit with errors, OFA staff manually messages them with a list of errors, and they have 45 days to submit a data package that is error free.

It seems to me that A is ideal from a user experience. We allow them to submit-> the data is parsed and given a submission status->we give them errors for them to fix and resubmit -> (They either resubmit a full package, or a partial one).

cc @dk-ui , @ADPennington

Correct. sometimes errors do not get fixed, and OFA uses the available data reported. Note that some errors are more serious than others. There are very few scenarios where TDRS rejects entire files (these fall under the "pre-parsing blocker" category and are good candidates for option B. Ideally, STTs would correct all errors prior to submission, but historically, enforcing this led to a state where no data were coming in. This was addressed by distinguishing between more/less serious errors and rejecting only the records associated with more serious errors. The risk here is that those rejected records sometimes never get fixed. The DIGIT team is open to revisiting the full file rejection (for pre-parsing + serious errors), but agree that option A is consistent with current transmission parameters.

@abottoms-coder i made light revisions to the epic summary, mostly to add clarifications and to include links to SSP materials.

Pre-parsing errors -- These types of errors are considered to be violations of the expected record layout, so the records are not "eligible" to be parsed or validated. See existing reference material (in UX murals and parsing notebooks) about what these errors are and how to detect them.

Out-of-range value errors – These are based on the abovementioned instructions.

Errors re: inconsistent values across data elements within a record – These are also based on the abovementioned instructions (e.g. If SSI recipient = yes, then SSI amount received > $0).

Errors re: inconsistent values across related records within a section file – These errors are also based on the abovementioned instructions (e.g. For every family (T1) record for a given month, there is no evidence that at least one adult (T2) or child (T3) associated with the family's case (T1) is a TANF recipient)

Errors re: inconsistent values across related sections of data – These errors are based on DIGIT-generated checks, and some reference material included in abovementioned feedback rpts. Because sections of data can be submitted at different points in time, current thinking around these checks suggest that these errors would need to be checked against data from the dB (e.g. total #of families reported in Section 1 > total # families reported in Section 3)

Errors re: inconsistent values across related records and/or sections over time -- Also based on DIGIT-generated checks and would benefit from checks against data from the dB (e.g. state did not submit enough case records to meet annual sample size requirements)

wanted to capture here too: as discussed during 8/3 backlog mtg, categories 5 and 6 errors can be implemented after release 3. currently checked outside of TDRS, but we want to bring into TDP to automate the generation of these reports to provide more timely feedback to STTs. cc: @reitermb

All tickets are closed associated with this epic.

raft-tech / TANF-app

TANF Data Parsing #1101