grush-eiq commented 5 months ago

tl;dr

Parallelize structured and unstructured data transformations within std-cdf-pipeline to improve runtime by removing checkpoint bottleneck.

What problem are you trying to solve?

Currently, run_base_cdf is our longest running op followed by derived notes ops. STD has strict SLAs with clients and reducing data pipeline run time for batch processing will help us

Why should we solve it?

We should solve this because any improvement in pipeline run time helps us better meet our SLAs, especially with clients like Principal who cannot send us data until after their workday begins. This proposal is also relatively low effort and can be implemented for STD only, so it provides good value per time spent. As seen in the screenshot below, running the notes in parallel let them finish loading and complete the derived phase before the structured data finished.

How do you propose to solve it?

I propose splitting cdf-pipeline into structured and unstructured paths. This will allow structured data and notes data to run in parallel, reducing the total run time of run_base_cdf. Most notes transformations do not rely on structured data transformations and vice versa. Within each phase of primitives and derived the structured and unstructured data rarely overlap. Only in the augment phase is the data combined.

The changes are fairly straightforward. We would split the current model folder for a client into subfolders for structured and unstructured, add tags for each, and split run_base_cdf selector into run_structured_base_cdf and run_unstructured_base_cdf, and update downstream jobs to be children of the appropriate parent

Below is the draft updated cdf-pipeline graph. After seeds and external sources are loaded, the pipeline splits into structured and unstructured before joining again downstream.

Comparison of new vs old process

The test run of the parallel pipeline took 14:12. I believe the final runtime would be slightly longer as there is no step to add back the claim id filtering to the note table

Compared to the old process:

What other approaches did you consider?

We don't need to implement this proposal, there are no added risks to keeping the status quo. However, there is a proposal from the LTD team to split full_cdf into three jobs: primitives, derived, and augment. WC has also split into primitives and features.

What could go wrong?

The biggest challenge is handling the edge cases where the two types of data overlap. Most clients have filters on the note table such as removing notes that don't have an associated claim or adding a note type based on the examiner table (reliance). Incomplete list of possible solutions:

Add a step to augmentation to perform claim id filtering on the notes table and drop invalid rows from note and note derived tables
Create mapnote and map[derived_tables] in parallelization, then do the filtering and create the standard tables
Add some structured tables to unstructured mapping as required on a client by client basis.

grush-eiq commented 5 months ago

Interesting process. Its a little more difficult to draft than a word doc, but does feel familiar since its a similar process to writing PRs. I like that we'd be able to keep track of proposals, but wonder if we should compare this to Confluence or other tools.

necaris commented 5 months ago

Another alternative -- using Confluence -- https://evolutioniq.atlassian.net/wiki/spaces/~71202006b3331a934e41ae848e3472fb2b93d4/pages/191201485/Lightweight+Proposals+Design+Documents.

IMHO GitHub is worth exploring because most engineers are using it ~all the time, Confluence is worth exploring because it's more designed for this sort of knowledge-base, but Google Docs wouldn't be a significant enough departure from what we're doing now to be worth trying.

necaris / rsimapnotify

2024-01-26: Structured/Unstructured Pipeline Parallelization #3