insight-lane / crash-model

Build a crash prediction modeling application that leverages multiple data sources to generate a set of dynamic predictions we can use to identify potential trouble spots and direct timely safety interventions.
https://insightlane.org
MIT License
112 stars 40 forks source link

Onboarding new city: Baltimore #121

Open terryf82 opened 6 years ago

terryf82 commented 6 years ago

@alicefeng I can't remember the specific issues you encountered when trying to run Baltimore through the pipeline, or is it now working?

It doesn't look as though OSM has a polygon for the city so it'll probably need to be handled through a separate approach like Brisbane.

alicefeng commented 6 years ago

@terryf82 Oh this was the issue where the individual crash ids were alphanumeric rather than strictly numeric which clashed with our data standards (at the time - not sure if we've modified the standards since then).

terryf82 commented 6 years ago

@alicefeng I've updated the crashes & concerns standards in the data_standards branch to allow for both string and numeric ids. Give it a run on Baltimore when you get a chance and let me know how it goes!

terryf82 commented 6 years ago

Hey @alicefeng the latest commits to the data_standards branch should allow you to get past the graph_from_place() problem that was preventing us from onboarding Baltimore.

Basically there's a new function there that checks the OSM API (nominatim) for a polygon. If it finds one it returns the position, which is fed into graph_from_place() as which_result=x (sometimes the polygon isn't the first result). If there's no polygon for a city, we use graph_from_point() against the city lat+lng instead.

I've been testing the Baltimore pipeline using crash data from https://data.maryland.gov/Public-Safety/MDTA-Accidents/rqid-652u (not sure if this is the same source you were using?) and even though the map is now built properly, it still breaks in train_model. I tried a few different config file setups (start_year, end_year etc) but no luck, mostly I hit this error (@bpben any thoughts?)

Training model... Outputting to: /app/data/baltimore/processed/ Segment features included: ['width', 'lanes', 'hwy_type', 'osm_speed', 'oneway'] Traceback (most recent call last): File "/opt/conda/envs/boston-crash-model/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/envs/boston-crash-model/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/app/src/models/train_model.py", line 206, in <module> crash_lags = format_crash_data(data_nonzero, 'crash', week, year) File "/app/src/models/model_utils.py", line 14, in format_crash_data target_idx = all_dates[(all_dates.year==target_year)&(all_dates.week==target_week)].index.values[0] IndexError: index 0 is out of bounds for axis 0 with size 0

alicefeng commented 6 years ago

@terryf82 Awesome about the function for checking if there's a polygon. And yes, that looks to be the dataset I was using.

alicefeng commented 6 years ago

I just tried running the updated data_standards branch on Baltimore and failed with the same error @terryf82 pasted above.

@bpben

alicefeng commented 6 years ago

I fixed an error on my end and tried rerunning the pipeline for Baltimore. It's still failing at the model training script though this time I got a different error from before:

File "/opt/conda/envs/boston-crash-model/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 268, in _binary_roc_auc_score raise ValueError("Only one class present in y_true. ROC AUC score " ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

bpben commented 6 years ago

Either there's no crashes or very few crashes. I just took a look at the canonical dataset you sent me and there's zero crashes there. Maybe send me the original crash dataset or just the full baltimore folder?

alicefeng commented 6 years ago

Yeah, that was due to an error on my part. I fixed it, reran the pipeline and now have a canonical dataset that has non-zero crashes in it. Using that dataset led me the second error posted here. I'll send you that file.

But you said even having all zeroes shouldn't lead to the first error right? (@terryf82 's dataset had non-zero weeks and he also got the first error)