szcf-weiya / ESL-CN

The Elements of Statistical Learning (ESL)的中文翻译、代码实现及其习题解答。
https://esl.hohoweiya.xyz
GNU General Public License v3.0
2.39k stars 588 forks source link

Leukemia Data #207

Open szcf-weiya opened 4 years ago

szcf-weiya commented 4 years ago

Paper: Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., … Lander, E. S. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439), 531–537. https://doi.org/10.1126/science.286.5439.531 Data: http://portals.broadinstitute.org/cgi-bin/cancer/publications/view/43 Applications in ESL: Section 18.4 image

szcf-weiya commented 4 years ago

Reproduce Fig. 18.5

raw_data scaled_data

szcf-weiya commented 4 years ago

R version

> lasso.path

Call:  glmnet(x = t(train_X), y = train_y, family = "binomial", lambda = grid) 

       Df    %Dev    Lambda
  [1,]  1 0.02309 0.3679000
  [2,]  1 0.03381 0.3642000
  [3,]  1 0.04433 0.3605000
  [4,]  2 0.05532 0.3567000
  [5,]  2 0.06634 0.3530000
...
 [96,] 16 0.96520 0.0151900
 [97,] 17 0.97370 0.0114700
 [98,] 17 0.98230 0.0077610
 [99,] 18 0.99070 0.0040480
[100,] 23 0.99920 0.0003355
> elnet.path

Call:  glmnet(x = t(train_X), y = train_y, family = "binomial", alpha = 0.8,      lambda = grid) 

       Df   %Dev    Lambda
  [1,]  4 0.2187 0.3679000
  [2,]  4 0.2269 0.3642000
  [3,]  4 0.2350 0.3605000
  [4,]  4 0.2432 0.3567000
  [5,]  4 0.2512 0.3530000
...
 [96,] 29 0.9700 0.0151900
 [97,] 30 0.9773 0.0114700
 [98,] 32 0.9846 0.0077610
 [99,] 37 0.9919 0.0040480
[100,] 45 0.9993 0.0003355

Julia version

julia> lasso_path

Logistic GLMNet Solution Path (100 solutions for 7129 predictors in 4994 passes):
─────────────────────────────────
       df    pct_dev            λ
─────────────────────────────────
  [1]  21  0.999225   0.000335463
  [2]  18  0.990733   0.00404803 
  [3]  17  0.982281   0.00776059 
  [4]  17  0.973758   0.0114732  
  [5]  16  0.9652     0.0151857  
...
 [96]   2  0.0662522  0.353029   
 [97]   2  0.0552272  0.356742   
 [98]   1  0.0443347  0.360454   
 [99]   1  0.0338104  0.364167   
[100]   1  0.0230916  0.367879   
─────────────────────────────────
julia> elnet_path

Logistic GLMNet Solution Path (100 solutions for 7129 predictors in 4683 passes):
────────────────────────────────
       df   pct_dev            λ
────────────────────────────────
  [1]  46  0.99932   0.000335463
  [2]  37  0.991922  0.00404803 
  [3]  32  0.984638  0.00776059 
  [4]  30  0.97734   0.0114732  
  [5]  29  0.970059  0.0151857  
...
 [96]   4  0.251197  0.353029   
 [97]   4  0.243146  0.356742   
 [98]   4  0.235044  0.360454   
 [99]   4  0.226891  0.364167   
[100]   4  0.218684  0.367879   
────────────────────────────────

No much difference, and actually the Julia version is just a wrapper of the Fortran code, while the R version actually can be a wrapper for the Fortran code.

szcf-weiya commented 4 years ago

Reproduce Fig. 18.6

err_and_dev_vs_log_lambda