michplunkett / ucpd-incident-scraper

This code is going to be used to scrape the UCPD Daily Incident page at a pre-determined frequency and store the incidents on a generic JSON data-store.
MIT License
3 stars 2 forks source link

Fix ambiguous location issue #49

Closed michplunkett closed 8 months ago

michplunkett commented 8 months ago

Describe your changes

Addressing invalid location returns for incident locations containing 'between', 'to', mulltiple 'and's, etc. This change should be the final one that allows me to return valid addresses for 99.4% of incidents.

Checklist before requesting a review

(ucpd-incident-scraper-py3.11) michaelp@MacBook-Air-18 ucpd-incident-scraper % make correct_location
python -m incident_scraper correct-location
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaelp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
API queries_quota: 60
Unable to process and geocode address: 5100 S. Drexel Ave. to 5700 S. Hyde Park Blvd.
Unable to geocode address: 5100 S. Drexel Ave. to 5700 S. Hyde Park Blvd.
S. Dorchester Ave. between E. 51st St. and E. 52nd St. was geocoded to: 5120 South Dorchester Avenue, Chicago, IL 60615
S. Blackstone Ave. between E. 55th St. and E. 56th St. was geocoded to: 5520 South Blackstone Avenue, Chicago, IL 60637
...
E. 54th St. between S. Maryland Ave. and S. Drexel Ave. was geocoded to: 5401 S Maryland Ave, Chicago, IL 60615
S. Kimbark Ave. between E. 50th St. and E. 51st St. was geocoded to: 5020 South Kimbark Avenue, Chicago, IL 60615-2922
E. 55th St. and S. Lake Park Ave. was geocoded to: 5500 South Lake Park Avenue, Chicago, IL 60637-1917
2458 of 16883 had their address updated.
2458 addresses were updated.
Waiting up to 5 seconds.
Sent all pending logs.
(ucpd-incident-scraper-py3.11) michaelp@MacBook-Air-18 ucpd-incident-scraper %
michplunkett commented 8 months ago

Updated XGBoost model as well:

python -m incident_scraper build-model
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaelp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Accuracy Score: 0.707519351271655
Precision Score: 0.8793890449438202
Recall Score: 0.7726361252506556
michplunkett commented 8 months ago

Extra test:

(ucpd-incident-scraper-py3.11) michaelp@MacBook-Air-18 ucpd-incident-scraper % make three_days
python -m incident_scraper days-back 3
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaelp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Beginning the UCPD Incident scraping process.
Finished with the UCPD Incident scraping process.
11 total incidents were scraped from the UCPD Incidents' site.
API queries_quota: 60
This incident has an insufficient number of keys: {}
1 of 11 contained malformed or voided information.
0 of 11 could not be processed by the GoogleMaps' Geocoder.
10 of 11 incidents were successfully processed.
Adding 10 of 11 incidents to the GCP Datastore.
Completed adding 10 of 11 incidents to the GCP Datastore.
0 of 1 'Information' incidents predicted into other categories.
1 of 11 incidents could NOT be added to the GCP Datastore.
Program shutting down, attempting to send 2 queued log entries to Cloud Logging...
Waiting up to 5 seconds.
Sent all pending logs.
(ucpd-incident-scraper-py3.11) michaelp@MacBook-Air-18 ucpd-incident-scraper %