ipno-llead / processing

Processing repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database
7 stars 5 forks source link

Enhancement/data validator update #513

Closed hungEA closed 1 year ago

hungEA commented 1 year ago

The following changes to data-validator are applied:

hungEA commented 1 year ago

@ayyubibrahimi Could you please check the process of generating event data? There seems to be a violation of unique constraint for event_uid.

ayyubibrahimi commented 1 year ago

@baoea there is a unique constraint built into the Builder class for the events module lib/events.py line 419, but I don't believe that you are accessing this module.

This process-data error is strange. In this processing repo, an error raised when there are duplicate event_uid at the fuse_agency stage, which prevents the processing repo from completing successfully. Additionally, I dropped any duplicate event_uid (just in case) before the final event table is output, but this process-data failure persists. Can you say more about which iteration of the event table is being fetched prior to the error?

hungEA commented 1 year ago

@ayyubibrahimi We're certainly not using your lib/events.py. The one that raised the error was in our validator https://github.com/ipno-llead/processing/blob/enhancement/data-validator-update/data-validator/event_importer.py. The issue here is that our validator did not add extra data to your processed event.csv, but when we imported the event.csv in the fuse folder to a temporary database, it threw error about unique constraint violation.

File "/runner/_work/processing/processing/data-validator/data_validator.py", line 86, in run_validator
    module.run(conn, df, be_cols)
  File "/runner/_work/processing/processing/data-validator/event_importer.py", line 167, in run
    cursor.copy_expert(
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "officers_event_event_uid_acf5b8ca_uniq"
DETAIL:  Key (event_uid)=(76a723fd1140db6abd7f0db0f53d43f2) already exists.
CONTEXT:  COPY officers_event, line 4617

So please double-check the output of the data in the fuse/event.csv for this event_uid=76a723fd1140db6abd7f0db0f53d43f2 as we have no way to check it before it's push to WRGL.

ayyubibrahimi commented 1 year ago

Hi @hungEA, both on my local and on wrgl there is only one entry for the event_uid=76a723fd1140db6abd7f0db0f53d43f2. I updated the fuse stage here so that duplicate event_uid are dropped before the fuse/event.csv table is generated, but as we see the error is still thrown which is why I asked about where this table was being fetched from, because the current code does not allow for duplicates.

hungEA commented 1 year ago

@ayyubibrahimi We have fixed our brady_uid constraint in the BE schema. Please have a look at the error of duplicated brady_uid in brady table.

github-actions[bot] commented 1 year ago

Review data changes at tx/9e0cf141-49db-4e68-a382-0c034eadb76d

When this PR is merged, this transaction will be applied.

github-actions[bot] commented 1 year ago

Transaction tx/9e0cf141-49db-4e68-a382-0c034eadb76d applied.