Fix: circular dependencies

ayyubibrahimi commented 1 year ago

Hi @pckhoi, as we discussed, I created a new stage to fix the circular dependency issue. I've called this stage consolidate and it currently only contains this one script. Do you see any issues? I consider this a temporary solution.

pckhoi commented 1 year ago

Why are we employing a temporary fix? Anything temporary by definition must contain a technical debt.

I will take a look over the weekend if you don't mind.

On Wed, Mar 22, 2023, 4:09 AM Ayyub Ibrahim @.***> wrote:

Hi @pckhoi https://github.com/pckhoi, as we discussed, I created a new stage to fix the circular dependency issue. I've called this stage consolidate and it currently only contains this https://github.com/ipno-llead/processing/blob/main/consolidate/consolidate.py one script. Do you see any issues? I consider this a temporary solution.

— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBE4TJZR25CJWGPCYUTW5IKIXANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>

ayyubibrahimi commented 1 year ago

Temporary might be a poor word choice. It's temporary in that it's incomplete.

For now, I've only added the master tables (I.e., personnel.csv) to the consolidate stage because that is what the dev team needs for their import process. I don't believe that anything here should change.

To complete the consolidate stage, the agency specific tables (I.e., per_new_orleans_pd.csv) need to be consolidated as well because these are used as the embeds on LLEAD.

pckhoi commented 1 year ago

Understood. This consolidate stage seems to share the same purpose as the fuse stage. At least that's what it sounds like from the name. Maybe we need to introduce a new stage between match and fuse instead.

And jumping into an implementation right away might not produce the best result. Perhaps it is worth documenting where and how there are conflicts before even attempting a fix. Do you mind documenting those cases in an issue first?

On Wed, Mar 22, 2023, 7:19 AM Ayyub Ibrahim @.***> wrote:

Temporary might be a poor word choice. It's temporary in that it's incomplete.

For now, I've only added the master tables (I.e., personnel.csv) to the consolidate stage because that is what the dev team needs for their import process. I don't believe that anything here should change.

To complete the consolidate stage, the agency specific tables (I.e., per_new_orleans_pd.csv) need to be consolidated as well because these are used as the embeds on LLEAD.

— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491#issuecomment-1478755383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBD45IQ3PG6WE6EPOYLW5JASPANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>

ayyubibrahimi commented 1 year ago

Sounds good.

ayyubibrahimi commented 1 year ago

I will document those cases for you to review this weekend @pckhoi . I just did a review based on your feedback. In my opinion, the consolidate stage addresses the issue appropriately.

In general, I do believe that we need to introduce stages beyond the fuse stage. For example, datamatch for clean/brady_iberia_da.csv would work better if it was matched with the output of fuse/personnel.csv

Because the dev team is soon wrapping up work for at least a few months, I would like to give them an update on where to find the final tables (consolidate directory vs fuse directory) ASAP.

If you agree, I will go ahead and tell them to continue working with the data in the consolidate directory

pckhoi commented 1 year ago

Sure Ayyub.

On Wed, Mar 22, 2023, 9:28 AM Ayyub Ibrahim @.***> wrote:

I will document those cases for you to review this weekend @pckhoi https://github.com/pckhoi . I just did a review based on your feedback. In my opinion, the consolidate stage addresses the issue appropriately, despite it teetering on redundant for the moment. That said, I don't believe that it is necessarily redundant because are instances in the pipeline where it would be better to use datamatch on output from the fuse (and consolidate) stage. With this in mind, I do believe that we are in need of introducing stages beyond the fuse stage.

Because the dev team is soon wrapping up work for at least a few months, I would like to give them an update on where to find the final tables ( consolidate directory vs fuse directory) ASAP.

If you agree, I will go ahead and tell them to continue working with the data in the consolidate directory

— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491#issuecomment-1478836223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBCLCGSONX3PYDXXN33W5JPUXANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>

ayyubibrahimi commented 1 year ago

Hey Khoi,

Here is an outline of what the issue is:

This issue is all related to duplicate officer profiles on LLEAD.

Currently, at the fuse stage, in the fuse/post_officer_history.py script we deduplicate the personnel table. This causes the other tables generated in the fuse stage (fuse/allegation.csv, fuse/use_of_force.csv, stop_and_search.csv, etc) to become inconsistent with the corresponding fuse/personnel.csv table. To fix this issue, I introduced the consolidate stage where we are dropping rows where the uid is not present in the personnel table (maybe it would be a better idea to use datamatch first). The circular dependencies were the result of me attempting to consolidate these tables in the fuse stage.

As I currently choose thresholds at the match stage, it is common for a true positive to be interpreted as a false negative. This issue has been addressed by deduplicating the personnel table, where the chosen threshold seems to produce less false negatives.

The underlying problem, I believe, is that we need to introduce additional parameters at the match stage and/or re-think the match stage. For example, it may be more appropriate to fuse the tables first, deduplicate, and then match, i.e., match per_new_orleans_pd.csv with cprr_new_orleans_pd.csv

That said, while this may produce more true positives, we are likely to remain in a situation where we have false negatives, which is why I believe that it is necessary to consolidate the output of the fuse stage. Rather than dropping the uid not present in the personnel table, maybe it makes more sense to add the uid to the personnel table. My only concern is that this would re-introduce false negatives/create duplicate officer profiles on LLEAD. It is also possible that my concern regarding duplicate officer profiles is overblown.

pckhoi commented 1 year ago

Yes, one look at the fuse/post_officer_history.py file and I understood where the problem is. It is reading input from fuse stage and write output also to fuse stage. If we were to follow these simple rules then we'll never have problem with circular dependency:

A script must belong to a stage
A script must output to the stage that it belongs to
A script must read input from another stage
Stages have execution order. And this order must be upheld at all times. For example, match stage can never read from fuse stage.

Just from looking at fuse/post_officer_history.py alone, I think we need to introduce more than 1 new stages:

stateDiagram-v2
    [*] --> clean
    clean --> match
    match --> fuse_agency
    fuse_agency --> match_history
    match_history --> fuse
    fuse --> [*]

I don't want the fuse stage to lose its position as the final stage because you would be discarding most of its history. Also there will be many tables that you must migrate over to the new stage without there being any change. If you and the dev team have already agreed upon using the consolidate stage then I'll think of other names. But I think you understand the gist of it.

ayyubibrahimi commented 1 year ago

Thank you. I will propose the changes to the dev team.

pckhoi commented 1 year ago

@ayyubibrahimi feel free to ask me to do more, I have 5 more hours this month.

ayyubibrahimi commented 1 year ago

Awesome. I'll be sure to let you know. Thanks!

ipno-llead / processing

Fix: circular dependencies #491