moriaki3193 / spike

Rust Machine Learning Library
MIT License
2 stars 0 forks source link

SVM-light formatted data converter #9

Open moriaki3193 opened 6 years ago

moriaki3193 commented 6 years ago

Reference

Implementation

moriaki3193 commented 6 years ago

入力となるCSVの例 (ヘッダーとレコード)

n_presi,n_avgsi4,n_disavgsi,n_goavgsi,n_sdwtdsi,w2c,eps,jnowin,jwinper,jst1miss,newdis,rid,post,draw,hname,jname,odds,fp,fp_std,dtime,distance,going
-0.46646920527455926,-0.7185993734587252,-1.0943026026511933,-1.9662897853661765,-0.6100055122825876,-0.8819171036881969,1.3565686339704375,0.010470901223703775,-0.11366412645408085,0,0,1001010109,1.0,1.0,インプレスウィナー,丸田 恭介,11.0,9.0,0.5333333333333333,2010-08-14 15:25,1200,良
0.15435766566966708,-0.3819592747601661,-0.8139674915670043,-0.6989631947952272,-1.2858243573606245,-0.8819171036881969,-0.8614705923753873,0.8481429991200058,0.29340345876039214,0,0,1001010109,2.0,1.0,ジョイントスターズ,中舘 英二,96.5,11.0,0.6666666666666666,2010-08-14 15:25,1200,良
-0.43114863931274267,0.38018319397651235,0.5555164721504604,0.14264269875793223,0.7670860731515375,-1.889822365046136,-0.8614705923753873,-0.8272011966725982,-0.8976215745771917,0,0,1001010109,3.0,2.0,モトヒメ,大野 拓弥,12.8,8.0,0.4666666666666667,2010-08-14 15:25,1200,良
0.34862077845965345,0.3215434120374956,1.3254320403912299,1.1588287258741705,-0.6680498305377315,0.1259881576697424,-0.8614705923753873,-0.15706351835555662,-0.5316965156615288,0,0,1001010109,4.0,2.0,オリオンザドンペリ,勝浦 正樹,39.8,7.0,0.4,2010-08-14 15:25,1200,良
0.5782044572114563,0.23506865912842137,1.5521144986363344,-0.13588079217626198,0.6839130667654052,0.1259881576697424,-0.14852941247851506,0.513074159961485,0.18721353542485208,0,0,1001010109,5.0,3.0,ローズカットダイヤ,秋山 真一郎,4.7,2.0,0.06666666666666667,2010-08-14 15:25,1200,良
-0.581940286303572,0.649798080150932,0.6655044104705695,0.660124144594732,0.041570399489356716,0.1259881576697424,1.5150000072808534,-0.324597937934817,-0.10968238766041334,0,0,1001010109,6.0,3.0,トウカイミステリー,武 幸四郎,10.1,1.0,0.0,2010-08-14 15:25,1200,良
-2.1292527720923524,-0.6364920323127365,-1.7545655616520976,-1.5736521354256425,-0.5054021453588856,0.1259881576697424,-0.8614705923753873,-0.8272011966725982,-0.6118569071602659,0,0,1001010109,7.0,4.0,エネルマオー,小林 徹弥,123.4,13.0,0.8,2010-08-14 15:25,1200,良
1.0332963647963929,0.8306089271426325,1.9061251364532717e-15,1.0071608377531043,0.0788788019691247,1.1338934190276817,0.5644117674183572,2.0208839361748283,1.1997022400948312,0,0,1001010109,8.0,4.0,デリキットピース,藤田 伸二,6.1,3.0,0.13333333333333336,2010-08-14 15:25,1200,良
0.9640137161789841,0.541602732740116,0.506781979154963,0.4524609639543896,0.779768547889198,-0.8819171036881969,0.5644117674183572,0.5689189664879051,0.539770844588229,0,0,1001010109,9.0,5.0,ルシュクル,四位 洋文,6.5,6.0,0.33333333333333337,2010-08-14 15:25,1200,良
moriaki3193 commented 6 years ago

one-hot vectorizer

use std::collections::BTreeMap;

let mut entity_id_table: BTreeMap<&str, usize> = BTreeMap::new();

let horse_names = vec![
  "キタサンブラック",
  "ディープインパクト",
  "アーモンドアイ",
  "ジェンティルドンナ",
  "ディープインパクト",
  "サトノダイアモンド",
  "アーモンドアイ",
];

let mut temp_id: usize = 1;

for horse_name in horse_names.iter() {
  entity_id_table.entry(horse_name).or_insert_with(|| {
    let this_entity_id = temp_id.clone();
    temp_id = temp_id + 1;
    this_entity_id
  });
}

println!("{:?}", entity_id_table);
/*
  {
    "アーモンドアイ": 3,
    "キタサンブラック": 1,
    "サトノダイアモンド": 5,
    "ジェンティルドンナ": 4,
    "ディープインパクト": 2
  }
*/
moriaki3193 commented 6 years ago

Algorithm

  1. read all records in a given CSV file
  2. one-hot vectorization of horse names in those records, and make a mapping table.
  3. for race in races 3-1. one-hot-vectorize horses in the race and make a (temporal) combination index vector 3-2. for horse in horses of the race 3-2-1. one-hot-vectorize the horse (a.k.a entity index vector) from the mapping table 3-2-2. extract the horse's basic feature (a.k.a context vector) 3-2-3. one-hot-vectorize opponents of the horse (a.k.a combination index vector) from the temporal combination index vector 3-2-4. concatenate 3 vectors (combination index vector ++ entity index vector ++ context vector)