thekingofkings / chicago-crime

Crime correlation anaysis
MIT License
11 stars 3 forks source link

Taxi flow feature as an independent view to predict crime #27

Closed thekingofkings closed 7 years ago

thekingofkings commented 7 years ago

We use the graph embedding to calculate vector representations of each CA. Then use this CA vector representation for crime prediction.

The graph embedding is generated with LINE paper and code.

thekingofkings commented 7 years ago

Setting:

  1. Predict crime count. (This is different from the rate prediction in KDD).
  2. Use LINE generate graph vector of size 20 as feature X.
  3. Python statsmodels GLM model.
  4. Leave one out evaluation.

GLM.NegativeBinomial model. Get the following error:

SVD did not converge.

GLM.Poisson model, the error is huge

MAE MRE
26421 5.519
thekingofkings commented 7 years ago

Issue found: The constant in feature term is missing

Update setting by adding the constant feature term -- 1.

Method training MAE traing MRE testing MAE testing MRE
Negative Binomial 3684 1.5815 3726.11 0.7784
Poisson 3997 1.5811 3995.54 0.8346
Gaussian 3764 1.6313 3520.31 0.7354

GLM.Gaussian gives lowest testing error, but not training error. How come?

thekingofkings commented 7 years ago

Update: use rate prediction instead of count prediction

Method training MAE training MRE testing MAE testing MRE
Negative Binomial 741.47 0.7186 578.88 0.4320
Poisson 761.99 0.7369 574.36 0.4286
Gaussian 768.25 0.7514 628.31 0.4689

GLM.Poisson gives the lowest testing error, but not training error. Why?

Also, notice that NB is overall close to the best model. GLM.Gaussian has a performance gap on both training and testing.

thekingofkings commented 7 years ago

Compare embedding vector with size 40 vs. size 20

Method testing MAE (40) testing MRE (40) testing MAE (20) testing MRE (20)
Negative Binomial 1010.40 0.7540 578.88 0.4320
Poisson 998.98 0.7455 574.36 0.4286
Gaussian 869.53 0.6489 628.31 0.4689

Conclusion: In our problem, we should prefer short graph embedding vector size. Fact: size 40 is worse than 20.

thekingofkings commented 7 years ago

Best setting to learn graph embedding:

  1. Consider 3 neighbors
  2. Vector size is 6
  3. train on 20 millions graph samples
  4. Number of negative sample is 5