Closed ayyubibrahimi closed 1 year ago
Why are we employing a temporary fix? Anything temporary by definition must contain a technical debt.
I will take a look over the weekend if you don't mind.
On Wed, Mar 22, 2023, 4:09 AM Ayyub Ibrahim @.***> wrote:
Hi @pckhoi https://github.com/pckhoi, as we discussed, I created a new stage to fix the circular dependency issue. I've called this stage consolidate and it currently only contains this https://github.com/ipno-llead/processing/blob/main/consolidate/consolidate.py one script. Do you see any issues? I consider this a temporary solution.
— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBE4TJZR25CJWGPCYUTW5IKIXANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>
Temporary might be a poor word choice. It's temporary in that it's incomplete.
For now, I've only added the master tables (I.e., personnel.csv
) to the consolidate
stage because that is what the dev team needs for their import process. I don't believe that anything here should change.
To complete the consolidate
stage, the agency specific tables (I.e., per_new_orleans_pd.csv
) need to be consolidated as well because these are used as the embeds on LLEAD.
Understood. This consolidate stage seems to share the same purpose as the fuse stage. At least that's what it sounds like from the name. Maybe we need to introduce a new stage between match and fuse instead.
And jumping into an implementation right away might not produce the best result. Perhaps it is worth documenting where and how there are conflicts before even attempting a fix. Do you mind documenting those cases in an issue first?
On Wed, Mar 22, 2023, 7:19 AM Ayyub Ibrahim @.***> wrote:
Temporary might be a poor word choice. It's temporary in that it's incomplete.
For now, I've only added the master tables (I.e., personnel.csv) to the consolidate stage because that is what the dev team needs for their import process. I don't believe that anything here should change.
To complete the consolidate stage, the agency specific tables (I.e., per_new_orleans_pd.csv) need to be consolidated as well because these are used as the embeds on LLEAD.
— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491#issuecomment-1478755383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBD45IQ3PG6WE6EPOYLW5JASPANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>
Sounds good.
I will document those cases for you to review this weekend @pckhoi . I just did a review based on your feedback. In my opinion, the consolidate
stage addresses the issue appropriately.
In general, I do believe that we need to introduce stages beyond the fuse
stage. For example, datamatch for clean/brady_iberia_da.csv
would work better if it was matched with the output of fuse/personnel.csv
Because the dev team is soon wrapping up work for at least a few months, I would like to give them an update on where to find the final tables (consolidate
directory vs fuse
directory) ASAP.
If you agree, I will go ahead and tell them to continue working with the data in the consolidate
directory
Sure Ayyub.
On Wed, Mar 22, 2023, 9:28 AM Ayyub Ibrahim @.***> wrote:
I will document those cases for you to review this weekend @pckhoi https://github.com/pckhoi . I just did a review based on your feedback. In my opinion, the consolidate stage addresses the issue appropriately, despite it teetering on redundant for the moment. That said, I don't believe that it is necessarily redundant because are instances in the pipeline where it would be better to use datamatch on output from the fuse (and consolidate) stage. With this in mind, I do believe that we are in need of introducing stages beyond the fuse stage.
Because the dev team is soon wrapping up work for at least a few months, I would like to give them an update on where to find the final tables ( consolidate directory vs fuse directory) ASAP.
If you agree, I will go ahead and tell them to continue working with the data in the consolidate directory
— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/issues/491#issuecomment-1478836223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBCLCGSONX3PYDXXN33W5JPUXANCNFSM6AAAAAAWC7T23U . You are receiving this because you were mentioned.Message ID: @.***>
Hey Khoi,
Here is an outline of what the issue is:
This issue is all related to duplicate officer profiles on LLEAD.
Currently, at the fuse stage, in the fuse/post_officer_history.py
script we deduplicate the personnel
table. This causes the other tables generated in the fuse stage (fuse/allegation.csv
, fuse/use_of_force.csv
, stop_and_search.csv
, etc) to become inconsistent with the corresponding fuse/personnel.csv
table. To fix this issue, I introduced the consolidate stage where we are dropping rows where the uid
is not present in the personnel
table (maybe it would be a better idea to use datamatch
first). The circular dependencies were the result of me attempting to consolidate these tables in the fuse stage.
As I currently choose thresholds at the match
stage, it is common for a true positive to be interpreted as a false negative. This issue has been addressed by deduplicating the personnel
table, where the chosen threshold seems to produce less false negatives.
The underlying problem, I believe, is that we need to introduce additional parameters at the match
stage and/or re-think the match
stage. For example, it may be more appropriate to fuse the tables first, deduplicate, and then match, i.e., match per_new_orleans_pd.csv
with cprr_new_orleans_pd.csv
That said, while this may produce more true positives, we are likely to remain in a situation where we have false negatives, which is why I believe that it is necessary to consolidate the output of the fuse stage. Rather than dropping the uid
not present in the personnel table, maybe it makes more sense to add the uid
to the personnel table. My only concern is that this would re-introduce false negatives/create duplicate officer profiles on LLEAD. It is also possible that my concern regarding duplicate officer profiles is overblown.
Yes, one look at the fuse/post_officer_history.py
file and I understood where the problem is. It is reading input from fuse
stage and write output also to fuse
stage. If we were to follow these simple rules then we'll never have problem with circular dependency:
match
stage can never read from fuse
stage.Just from looking at fuse/post_officer_history.py
alone, I think we need to introduce more than 1 new stages:
stateDiagram-v2
[*] --> clean
clean --> match
match --> fuse_agency
fuse_agency --> match_history
match_history --> fuse
fuse --> [*]
I don't want the fuse stage to lose its position as the final stage because you would be discarding most of its history. Also there will be many tables that you must migrate over to the new stage without there being any change. If you and the dev team have already agreed upon using the consolidate stage then I'll think of other names. But I think you understand the gist of it.
Thank you. I will propose the changes to the dev team.
@ayyubibrahimi feel free to ask me to do more, I have 5 more hours this month.
Awesome. I'll be sure to let you know. Thanks!
Hi @pckhoi, as we discussed, I created a new stage to fix the circular dependency issue. I've called this stage
consolidate
and it currently only contains this one script. Do you see any issues? I consider this a temporary solution.