Some features, such as POI and Geo, are not working for several crime categories, but are helpful for the total crime prediction.
How to design experiment to explain this phenomenon?
Notations
Without loss of generality, we use POI features and Narcotics crime category for discussion. The model without POI feature is called model 1. With POI feature, it is model 2.
Purely calculating the various correlation between POI feature and crime count does not really explain why the prediction gets worse. Model 2 makes the error higher than model 1 is equivalent to that POI features does not correlate with crime. This does not take us anywhere.
Solution
Compare the results of model 2 and model 1. There must be some regions getting better prediction, denoted as P, and regions getting worse, denoted as N.
It should be intuitive that some properties Q cause the difference. Therefore, the Q properties within P or N should be more similar, meanwhile, the Q properties across P and N are less similar.
The problem is try to search this Q with the data we have, and possible Q are:
POI distribution
crime count
crime time series #18
We fill the following table
Q
dist(PxP)
dist(NxN)
dist(PxN)
POI distribution
Crime count
Crime time series
where dist(PxP) means the average distance of property Q of pair (p,q) from PxP
if we can find a Q, such that dist(PxN) > dist(PxP) and dist(PxN) > dist(NxN), then Q is the reason. If no such Q, it is the data availability to blame.
Counter-intuitive Observations
Some features, such as POI and Geo, are not working for several crime categories, but are helpful for the total crime prediction.
How to design experiment to explain this phenomenon?
Notations
Without loss of generality, we use POI features and Narcotics crime category for discussion. The model without POI feature is called model 1. With POI feature, it is model 2.
Purely calculating the various correlation between POI feature and crime count does not really explain why the prediction gets worse. Model 2 makes the error higher than model 1 is equivalent to that POI features does not correlate with crime. This does not take us anywhere.
Solution
Compare the results of model 2 and model 1. There must be some regions getting better prediction, denoted as P, and regions getting worse, denoted as N.
It should be intuitive that some properties Q cause the difference. Therefore, the Q properties within P or N should be more similar, meanwhile, the Q properties across P and N are less similar.
The problem is try to search this Q with the data we have, and possible Q are:
We fill the following table
where dist(PxP) means the average distance of property Q of pair (p,q) from PxP
if we can find a Q, such that dist(PxN) > dist(PxP) and dist(PxN) > dist(NxN), then Q is the reason. If no such Q, it is the data availability to blame.