openoakland / OakCrime-Decommissioned

Code supporting citizen analysis of crime in Oakland, CA
22 stars 16 forks source link

Capture OPD UCR statistics reporting from PDFs #40

Closed donnell794 closed 3 years ago

rbelew commented 5 years ago

There are large directories of summary statistics posted by OPD here. As OPD descibes, these are "Uniform Crime Report" stats headed to the FBI. Reconciling these external to Oakland summary stats to the internal to Oakland daily reporting is a major goal for OakCrime.org

happily, these PDFs also made available using Box, and so the same API access should work.

Sadly, they're YAPF (Yet Another PDF Format) and so some serious parsing energy will be required.

shawnvarghese commented 5 years ago

I tried using multiple parsers to parse the PDFs without much success unfortunately. PyPDF2 didn't seem to do much. PDFMiner was slightly better but all the text was in one block and it would be difficult to extract the data. PDFQuery is built on top of PDFMiner as I understand it. There seems to be a way to get an XML layout from the PDF and try to extract the text based on keywords. But it seemed very unreliable to do it that way. I've had more success converting the PDF to a .tiff image and then using tesseract to convert the .tiff to text. ImageMagick can be used to convert pdf to images, but it was crashing on my PC for some reason. For now, I used an online converter to generate the .tiff image. The output text was more far more manageable to parse. I'll continue to explore more options to see if it's possible to use the one step process and directly parse the PDF.

shawnvarghese commented 5 years ago

@rbelew The parser generates CSV and JSON files now, should we change it to write to a database table instead so that we can query it for when we need to compare statistics?

rbelew commented 5 years ago

great idea! as we try to merge/compare with the incident data having it be database queries will be easier. and i think this corpus deserves its own interface! it's a pretty straight-forward time series, without any mapping requirements, and of interest to people all by itself.