openjusticeok / ojoregex

A seperate package for maintaining the regex patterns that we use in our data normalization pipeline.
https://openjusticeok.github.io/ojoregex/
GNU General Public License v3.0
0 stars 0 forks source link

Add more precleaning steps #11

Open andrewjbe opened 5 months ago

andrewjbe commented 5 months ago

Gonna put all the ones I encounter in here so I don't forget to to do this later.

andrewjbe commented 5 months ago

Also might want to add a new var to note if the charge was amended?

andrewjbe commented 5 months ago

"(MUNICIPAL ARREST)" seems to be a common thing that we could chop off

andrewjbe commented 5 months ago

There are also some where they list multiple counts in the same row, e.g. "CT1-UNLAWFUL POSS MARIHUANA CT.2-OBSTRUCT AN OFFICE". Those rows should probably just be axed entirely IMO

andrewjbe commented 5 months ago

The ones with "in violation of __" with all the numbers are tripping up some of the numeric based ones like "Speeding" which looks for things like "X in Y"

EDIT: I fixed this by editing the speeding flag

andrewjbe commented 5 months ago

Could also just get rid of all the stupid special characters like ÿ