kakao / buffalo

TOROS Buffalo: A fast and scalable production-ready open source project for recommender systems
Apache License 2.0
575 stars 106 forks source link

Training fails #8

Closed ita9naiwa closed 5 years ago

ita9naiwa commented 5 years ago

I downloaded KakaoBrunch12M dataset and ran by copying example_als.py.

but it seems that training fails (Nan Error)

Code

def example1():
    log.set_log_level(log.DEBUG)
    als_option = ALSOption().get_default_option()
    als_option.validation = aux.Option({'topk': 10})
    data_option = MatrixMarketOptions().get_default_option()
    data_option.input.main = '../tests/data/ext/main'
    #data_option.input.iid = '../tests/data/iid'

    als = ALS(als_option, data_opt=data_option)
    als.initialize()
    als.train()
    print('metrics for validations\n%s' % json.dumps(als.get_validation_results(), indent=2))

    print('Run hyper parameter optimization for val_ndcg...')
    als.opt.num_workers = 4
    als.opt.evaluation_period = 10
    als.opt.optimize = aux.Option({
        'loss': 'val_ndcg',
        'max_trials': 100,
        'deployment': True,
        'start_with_default_parameters': True,
        'space': {
            'd': ['randint', ['d', 10, 128]],
            'reg_u': ['uniform', ['reg_u', 0.1, 1.0]],
            'reg_i': ['uniform', ['reg_i', 0.1, 1.0]],
            'alpha': ['randint', ['alpha', 1, 10]],
        }
    })
    log.set_log_level(log.INFO)

    als.opt.model_path = './example1.ml100k.als.optimize.bin'
    print(json.dumps({'alpha': als.opt.alpha, 'd': als.opt.d,
                      'reg_u': als.opt.reg_u, 'reg_i': als.opt.reg_i}, indent=2))
    als.optimize()
    als.load('./example1.ml100k.als.optimize.bin')

Logs

[ERROR   ] 2019-08-31 07:54:58 [als.py:22] ImportError CuALS, no cuda library exists. error message: No module named 'buffalo.algo.cuda'
[INFO    ] 2019-08-31 07:54:58 [mm.py:193] Create the database from matrix market file.
[DEBUG   ] 2019-08-31 07:54:58 [mm.py:198] Building meta part...
[DEBUG   ] 2019-08-31 07:54:58 progress:   0%| 00:00<?
[INFO    ] 2019-08-31 07:54:58 [mm.py:206] Creating working data...
[INFO    ] 2019-08-31 07:54:58 progress:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:08 [mm.py:210] Working data is created on /tmp/tmpi_15tjok
[INFO    ] 2019-08-31 07:55:08 [mm.py:211] Building data part...
[INFO    ] 2019-08-31 07:55:08 [base.py:346] Building compressed triplets for rowwise...
[INFO    ] 2019-08-31 07:55:08 [base.py:347] Preprocessing...
[INFO    ] 2019-08-31 07:55:08 [base.py:350] In-memory Compressing ...
[INFO    ] 2019-08-31 07:55:11 [base.py:249] Load triplet bin: b'/tmp//chunk.bin'
[INFO    ] 2019-08-31 07:55:12 [base.py:380] Finished
[INFO    ] 2019-08-31 07:55:12 [base.py:346] Building compressed triplets for colwise...
[INFO    ] 2019-08-31 07:55:12 [base.py:347] Preprocessing...
[INFO    ] 2019-08-31 07:55:12 [base.py:350] In-memory Compressing ...
[INFO    ] 2019-08-31 07:55:15 [base.py:249] Load triplet bin: b'/tmp//chunk.bin'
[INFO    ] 2019-08-31 07:55:15 [base.py:380] Finished
[INFO    ] 2019-08-31 07:55:16 [mm.py:225] DB built on ./mm.h5py
[INFO    ] 2019-08-31 07:55:16 [als.py:56] ALS({
  "evaluation_on_learning": true,
  "compute_loss_on_training": true,
  "early_stopping_rounds": 0,
  "save_best": false,
  "evaluation_period": 1,
  "save_period": 10,
  "random_seed": 0,
  "validation": {
    "topk": 10
  },
  "adaptive_reg": false,
  "save_factors": false,
  "accelerator": false,
  "d": 20,
  "num_iters": 10,
  "num_workers": 1,
  "hyper_threads": 256,
  "num_cg_max_iters": 3,
  "reg_u": 0.1,
  "reg_i": 0.1,
  "alpha": 8,
  "optimizer": "manual_cg",
  "cg_tolerance": 1e-10,
  "eps": 1e-10,
  "model_path": "",
  "data_opt": {}
})
[INFO    ] 2019-08-31 07:55:16 [als.py:58] MatrixMarket Header(306291, 505926, 12600038) Validation(500 samples)
[debug   ] 2019-08-31 07:55:16 [als.cc:72] P(306291 x 20) Q(505926 x 20) setted
[INFO    ] 2019-08-31 07:55:16 [buffered_data.py:71] Set data buffer size as 67108864(minimum required batch size is 35447).
[DEBUG   ] 2019-08-31 07:55:17 [base.py:342] Cannot find tensorboard configuration.
[DEBUG   ] 2019-08-31 07:55:17 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:18 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.065s update: 1.25s)
[DEBUG   ] 2019-08-31 07:55:18 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:20 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.063s update: 1.76s)
[INFO    ] 2019-08-31 07:55:23 [als.py:173] Validation: ndcg:0.00678 map:0.00393 accuracy:0.01616 rmse:2.49048 error:1.70700 Elapsed 3.212 secs
[INFO    ] 2019-08-31 07:55:23 [als.py:176] Iteration 1: RMSE 0.034 Elapsed 3.134 secs
[DEBUG   ] 2019-08-31 07:55:23 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:24 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.28s)
[DEBUG   ] 2019-08-31 07:55:24 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:26 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.84s)
[INFO    ] 2019-08-31 07:55:29 [als.py:173] Validation: ndcg:0.01974 map:0.01420 accuracy:0.03737 rmse:2.40212 error:1.58591 Elapsed 3.219 secs
[INFO    ] 2019-08-31 07:55:29 [als.py:176] Iteration 2: RMSE 0.030 Elapsed 3.122 secs
[DEBUG   ] 2019-08-31 07:55:29 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:31 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.28s)
[DEBUG   ] 2019-08-31 07:55:31 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:32 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.79s)
[INFO    ] 2019-08-31 07:55:36 [als.py:173] Validation: ndcg:0.01994 map:0.01383 accuracy:0.03939 rmse:2.39055 error:1.56148 Elapsed 3.256 secs
[INFO    ] 2019-08-31 07:55:36 [als.py:176] Iteration 3: RMSE 0.029 Elapsed 3.069 secs
[DEBUG   ] 2019-08-31 07:55:36 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:37 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.28s)
[DEBUG   ] 2019-08-31 07:55:37 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:39 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.79s)
[INFO    ] 2019-08-31 07:55:42 [als.py:173] Validation: ndcg:0.02089 map:0.01373 accuracy:0.04343 rmse:2.38696 error:1.55429 Elapsed 3.247 secs
[INFO    ] 2019-08-31 07:55:42 [als.py:176] Iteration 4: RMSE 0.029 Elapsed 3.080 secs
[DEBUG   ] 2019-08-31 07:55:42 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:43 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.28s)
[DEBUG   ] 2019-08-31 07:55:43 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:45 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.79s)
[INFO    ] 2019-08-31 07:55:48 [als.py:173] Validation: ndcg:0.02544 map:0.01794 accuracy:0.04949 rmse:2.38435 error:1.55086 Elapsed 3.222 secs
[INFO    ] 2019-08-31 07:55:48 [als.py:176] Iteration 5: RMSE 0.029 Elapsed 3.066 secs
[DEBUG   ] 2019-08-31 07:55:48 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:50 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.27s)
[DEBUG   ] 2019-08-31 07:55:50 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:51 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.84s)
[INFO    ] 2019-08-31 07:55:55 [als.py:173] Validation: ndcg:0.02673 map:0.01849 accuracy:0.05354 rmse:2.38284 error:1.54936 Elapsed 3.280 secs
[INFO    ] 2019-08-31 07:55:55 [als.py:176] Iteration 6: RMSE 0.029 Elapsed 3.113 secs
[DEBUG   ] 2019-08-31 07:55:55 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:56 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.28s)
[DEBUG   ] 2019-08-31 07:55:56 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:55:58 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.81s)
[INFO    ] 2019-08-31 07:56:00 [als.py:173] Validation: ndcg:0.00000 map:0.00000 accuracy:0.00000 rmse:nan error:nan Elapsed 1.877 secs
[INFO    ] 2019-08-31 07:56:00 [als.py:176] Iteration 7: RMSE nan Elapsed 3.095 secs
[DEBUG   ] 2019-08-31 07:56:00 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:01 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.26s)
[DEBUG   ] 2019-08-31 07:56:01 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:03 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.8s)
[INFO    ] 2019-08-31 07:56:05 [als.py:173] Validation: ndcg:0.00000 map:0.00000 accuracy:0.00000 rmse:nan error:nan Elapsed 1.882 secs
[INFO    ] 2019-08-31 07:56:05 [als.py:176] Iteration 8: RMSE nan Elapsed 3.060 secs
[DEBUG   ] 2019-08-31 07:56:05 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:06 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.3s)
[DEBUG   ] 2019-08-31 07:56:06 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:08 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.77s)
[INFO    ] 2019-08-31 07:56:10 [als.py:173] Validation: ndcg:0.00000 map:0.00000 accuracy:0.00000 rmse:nan error:nan Elapsed 1.863 secs
[INFO    ] 2019-08-31 07:56:10 [als.py:176] Iteration 9: RMSE nan Elapsed 3.076 secs
[DEBUG   ] 2019-08-31 07:56:10 rowwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:11 [als.py:134] rowwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.26s)
[DEBUG   ] 2019-08-31 07:56:11 colwise:   0%| 00:00<?
[DEBUG   ] 2019-08-31 07:56:13 [als.py:134] colwise updated: processed(12600038) elapsed(data feed: 0.000s update: 1.78s)
[INFO    ] 2019-08-31 07:56:14 [als.py:173] Validation: ndcg:0.00000 map:0.00000 accuracy:0.00000 rmse:nan error:nan Elapsed 1.867 secs
[INFO    ] 2019-08-31 07:56:14 [als.py:176] Iteration 10: RMSE nan Elapsed 3.050 secs
[INFO    ] 2019-08-31 07:56:14 [als.py:182] elapsed for full epochs: 57.79 sec
metrics for validations
{
  "ndcg": 0.0,
  "map": 0.0,
  "accuracy": 0.0,
  "rmse": NaN,
  "error": NaN
}
Run hyper parameter optimization for val_ndcg...
{
  "alpha": 8,
  "d": 20,
  "reg_u": 0.1,
  "reg_i": 0.1
}
[INFO    ] 2019-08-31 07:56:16 optimizing... :   0%| 00:00<?
[INFO    ] 2019-08-31 07:56:29 [optimize.py:44] Starting with default parameter result: {'train_loss': nan, 'val_ndcg': 0.0, 'val_map': 0.0, 'val_accuracy': 0.0, 'val_rmse': nan, 'val_error': nan, 'eval_time': 1567205789.465719, 'loss': -0.0, 'status': 'ok'}
세그멘테이션 오류 (core dumped)

How can I solve this problem?

I'm using version 1.0.4 of Buffalo.

ummae commented 5 years ago

@ita9naiwa Would you try again with buffalo==1.0.5?

ita9naiwa commented 5 years ago

@ummae it works fine with buffalo==1.0.5 and movielens-100k dataset.

thanks!