Taxi flow feature as an independent view to predict crime

thekingofkings / chicago-crime

Crime correlation anaysis

MIT License

11 stars 3 forks source link

Closed thekingofkings closed 7 years ago

thekingofkings commented 7 years ago

We use the graph embedding to calculate vector representations of each CA. Then use this CA vector representation for crime prediction.

The graph embedding is generated with LINE paper and code.

thekingofkings commented 7 years ago

Setting:

Predict crime count. (This is different from the rate prediction in KDD).
Use LINE generate graph vector of size 20 as feature X.
Python statsmodels GLM model.
Leave one out evaluation.

GLM.NegativeBinomial model. Get the following error:

SVD did not converge.

GLM.Poisson model, the error is huge

MAE	MRE
26421	5.519

thekingofkings commented 7 years ago

Update setting by adding the constant feature term -- 1.

Method	training MAE	traing MRE	testing MAE	testing MRE
Negative Binomial	3684	1.5815	3726.11	0.7784
Poisson	3997	1.5811	3995.54	0.8346
Gaussian	3764	1.6313	3520.31	0.7354

GLM.Gaussian gives lowest testing error, but not training error. How come?

thekingofkings commented 7 years ago

Method	training MAE	training MRE	testing MAE	testing MRE
Negative Binomial	741.47	0.7186	578.88	0.4320
Poisson	761.99	0.7369	574.36	0.4286
Gaussian	768.25	0.7514	628.31	0.4689

GLM.Poisson gives the lowest testing error, but not training error. Why?

Also, notice that NB is overall close to the best model. GLM.Gaussian has a performance gap on both training and testing.

thekingofkings commented 7 years ago

Method	testing MAE (40)	testing MRE (40)	testing MAE (20)	testing MRE (20)
Negative Binomial	1010.40	0.7540	578.88	0.4320
Poisson	998.98	0.7455	574.36	0.4286
Gaussian	869.53	0.6489	628.31	0.4689

Conclusion: In our problem, we should prefer short graph embedding vector size. Fact: size 40 is worse than 20.

thekingofkings commented 7 years ago

Best setting to learn graph embedding: