Closed bpben closed 1 year ago
This is going to be a vital feature of the 2.0 release and our project more broadly. Before we dig too much into the implementation I think it would help if we aligned on terminology since this has already been referenced under several different names:
or something else? It'll have to appear in the interface as a heading, links etc. and be easily understood by our users so what does everyone prefer?
@j-t-t @bpben @alicefeng @shreyapandit
Vote for Profiles.
I think we need to take the riskiest predicted segments and extract some patterns. Maybe some kind of PCA approach might yield something here. But we'd need a way to extract recognizable patterns on a scalable basis.
Profiles sounds good! We could do an EDA with our existing data, A) statistical analysis on few things with data as is and B) Clustering data and then C) apply an interpretability layer over the results which allows us to see what makes those clusters distinct. Will start working on some of these points.
Work in progress on the branch profile_analysis. I create a pickle file during the model run and use that for my analysis
Correlation matrix Cambridge:
Correlation matrix Boston
I ran it on Cambridge data as well - High risk cluster is much smaller compared to boston.
High Risk v/s Low Risk segments Boston:
We can pick out distinct clusters of RED points (high risk segments) and Grey points (Low risk segments) together.
High Risk v/s Low Risk segments Cambridge:
We can see that TSNE thinks that values of RED high risk segments and GREY low risk segments are closer - Hence the more homogenous intermingling of gray and red.
For Boston KMeans gives intiuitive clusters:
Currently, for v2.0, we're going to put together a POC version of these profiles for city stakeholder review. If we get a sense that this is a useful functionality, we'll develop it. Otherwise, we may go back to something like #103
Okay, update, we got here with experimenting with personas: 8292c73d61788d3b5008b093d29186632f4a4b5a. Again, on hold because we want to figure out the best way to show this in the visual that enables stakeholders to make use of it. But we've got something here, at least.
Closing this as stale, will revive as it becomes relevant.
As a city stakeholder, I'm interested in "why" a given segment is high risk. This is what we're getting at, somewhat, with #103 an #104. But maybe we can digest #103 a bit more: Look at the most risky segments and see if there's something common about them. Are they all high speed? Do they all intersect with an offramp from a highway? These are what I mean by "personas". Can I, as a stakeholder, beyond just receiving a ranking, be given some idea what is the common "profile" of a high risk segment?