Notebooks and Python scripts to combine & integrate LLEAD data. All generated data are kept in Wrgl.
To contribute you must use the following tools:
# install all related packages
pip install -r requirements.txt
# pull raw data input with dvc
dvc checkout
# initialize the wrgl repo
wrgl init
# process everything
make
# check whether the output match the schema from data/datavalid.yml
python -m datavalid --dir data
Every data integration workflow including this one has these steps:
full_name
, full_address
) and which doesn't have clearly delineated structure. For each value, a set of possible tags should be applied to the characters, which can then become the basis to segment this column into multiple output columns.middle_initial
from middle_name
).This integration pipeline produces the following kind of data for each police agency:
data/fuse/per_{agency}.csv
.data/fuse/com_{agency}.csv
.data/fuse/event_{agency}.csv
.data/fuse/uof_{agency}.csv
.data/fuse/sas_{agency}.csv
.The last step of the pipeline combines data files from all agencies into one file for each type:
data/fuse/personnel.csv
data/fuse/allegation.csv
data/fuse/event.csv
data/fuse/use_of_force.csv
data/fuse/stop_and_search.csv
See data/datavalid.yml for more details regarding the schema.
data/raw
folder. Run scripts/dvc_add.sh
to keep track of them in DVC.notebooks
folder with a distinct name that should at least include the name of the dataset that you were exploring.clean
folder which do what is outlined in the "Standardization & cleaning" step in data integration principles section. There are some rules for writing clean scripts:
deba.data
.data/clean
folder using deba.data
.pandas.DataFrame.pipe
is the preferred way to join the steps together.match
folder which do the "Data matching" step in data integration principles section. We use the datamatch library which not only facilitates record linkage but also data deduplication. Datamatch does not use machine learning but relies on a simple threshold-based algorithm. Still, it is very flexible in what it can do and has the added benefits of being easy to understand and running very fast. Match scripts should follow most of the rules for clean scripts with a few additional rules:
data/match
folder with the name in this format: {agency}_{source_a}_v_{source_b}.xlsx
. For example, new_orleans_harbor_pd_cprr_2020_v_pprr_2020.xlsx
shows matched records between New Orleans Harbor PD CPRR 2020
and New Orleans Harbor PD PPRR 2020
datasets. See existing match scripts for example.data/match
folder using deba.data
.data/match
folder showing matched records in an easy-to-review format. Each has 3 sheets:
match_threshold
: the cut-off similarity score that the matcher has decided on. Everything below this score is considered non-matchnumber_of_matched_pairs
: the number of matched pairs using this threshold.fuse
folder which do the "Data fusion" step in data integration principles section. They follow most of the rules for clean scripts plus a few more rules:
lib.columns
package to validate and rearrange columns for each file type according to the schema in data/datavalid.yml
.data/fuse
folder using deba.data
.make
. If there's no problem then you will see new data files being generated.python -m datavalid --dir data
which will check and print out any error found in the newly generated data.event_baton_rouge_pd.csv
correspond to branch event-baton-rouge-pd
.wrgl pull --all
. This pulls all the latest data changes for all branches.wrgl diff --all
to see all the changes you made with all existing branches. Run wrgl diff {branch}
to review detailed changes for a single branch.wrgl commit --all "{commit message}"
. The commit message should be short and describe the changes that you made. Something similar to the Git commit message is good.wrgl push --all
.This repository proposes a workflow and some utilty scripts to help extract tables from PDF with Azure From Recognizer. There are 2 workflows:
You can simply use one of FR's prebuilt models, specifically the Layout model. Just use the web-based FormRecognizerStudio to upload documents and extract data. If there are too many pages to extract manually, we can add the ability to automate extraction from prebuilt models to scripts/extract_tables_from_doc.py.
You need to train a custom model and use that model to extract data. Follow these steps:
.env
file (see python-dotenv to learn the syntax) at the root directory of this repository with the following environment variable:
FORM_RECOGNIZER_ENDPOINT
: follow these instruction to get the endpoint and key for a Form Recognizer resource.FORM_RECOGNIZER_KEY
: see above.BLOB_STORAGE_CONNECTION_STRING
: create an Azure storage account to store training data and follow this guide to get the connection string.FORM_RECOGNIZER_CONTAINER
: create a container in the same storage account and put the name here. It will contain all training data.scripts/split_pdf.py
. Upload those pages to a folder (preferably with the same name as the original PDF file) in the training container. Learn more here.scripts/edit_fr_table.py
to remove and insert rows. E.g.
scripts/edit_fr_table.py st-tammany-booking-log-2020/0009.pdf charges insertRow 1 2
scripts/extract_tables_from_doc.py https://www.dropbox.com/s/9zmpmhrhtashq2o/st_tammany_booking_log_2020.pdf\?dl\=1 tables/st_tammany_booking_log_2020 --end-page 839 --model-id labeled_11 --batch-size 1
# pull all branches
wrgl pull --all
# show changes for all
wrgl diff --all
# show in-depth changes for a single branch
wrgl diff event
# commit a single branch
wrgl commit event "my new commit"
# commit all branches
wrgl commit --all "my new commit"
# log-in with your wrglhub credentials
wrgl credentials authenticate https://hub.wrgl.co/api
# push all changes
wrgl push --all
# push a single branch
wrgl push event
# pull all dvc-tracked files
dvc checkout
# authenticate DVC so that you can push new files
gcloud auth login
dvc remote modify --local gcs credentialpath ~/.config/gcloud/legacy_credentials/<your email>/adc.json
# update dvc after making changes
scripts/dvc_add.sh
# push file changes to google cloud storage
dvc push
As you might notice, we never have to declare script dependency anywhere because Make can figure out the dependency automatically. We do have to write the scripts in a particular way but the benefits are well worth it. We also use md5 checksums of the scripts as recipe dependencies instead of the scripts themselves, which makes the processing resistant against superfluous file changes caused by Git. See Makefile and scripts/write_deps.py to learn more.