kanyun-inc / ytk-learn

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
MIT License
347 stars 76 forks source link

[ERROR] waiting for heartbeat. master will be shutdowned! #5

Closed marshalWS closed 6 years ago

marshalWS commented 6 years ago

About 190W lines train data. 40w test data. What does this error mean? Can resolve it.?

2017.12.27 13:54:10 com.fenbi.mp4j.comm.CommMaster - slave num:1, port:65534 2017.12.27 13:54:10 org.apache.hadoop.ipc.CallQueueManager - Using callQueue class java.util.concurrent.LinkedBlockingQueue 2017.12.27 13:54:10 org.apache.hadoop.ipc.Server - Starting Socket Reader #1 for port 65534 2017.12.27 13:54:11 org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017.12.27 13:54:11 com.fenbi.mp4j.comm.CommMaster - rpc server started!, rpcport=65534 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - configFile:config/model/flt_gbdt.conf 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - configPath:config/model/flt_gbdt.conf 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - pyTransformScript: 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - loginName:user 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - hostName:BOAXGLNJW0FEFII, hostPort:65534 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - threadNum:6 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - modelName:gbdt 2017.12.27 13:54:11 org.apache.hadoop.ipc.Server - IPC Server listener on 65534: starting 2017.12.27 13:54:11 org.apache.hadoop.ipc.Server - IPC Server Responder: starting 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - master host:BOAXGLNJW0FEFII, master port:65534 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - connecting:BOAXGLNJW0FEFII###62585, connected count:1 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - host names before sort:[BOAXGLNJW0FEFII###62585] 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - host names after sort:[BOAXGLNJW0FEFII###62585] 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - current slave's rank:0, address:BOAXGLNJW0FEFII###62585 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - this slave recv data port:62585 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - slave num:1 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - slave rank:0 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - Pid is:7748 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - Pid is:7748 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - slaves addresses: 2017.12.27 13:54:11 com.fenbi.mp4j.comm.ProcessCommSlave - BOAXGLNJW0FEFII:62585 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - this slave init finished! 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - ################ parameters ################ 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.delim.feature_name_val_delim=ConfigString(":") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.dict_path=ConfigString("config/model/feat_dict") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.min_split_loss=ConfigInt(0) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.max_leaf_cnt=ConfigInt(16) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.need_dict=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.continue_train=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.max_feature_dim=ConfigInt(40) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.delim.x_delim=ConfigString("###") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.feature_importance_path=ConfigString("config/model/feature_importance") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.test.max_error_tol=ConfigInt(0) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.min_split_samples=ConfigInt(-1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.watch_test=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.delim.y_delim=ConfigString(",") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.min_child_hessian_sum=ConfigInt(1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.watch_train=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.delim.features_delim=ConfigString(" ") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.y_sampling=SimpleConfigList([]) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.regularization.l1=ConfigInt(0) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.regularization.l2=ConfigInt(1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.feature_sample_rate=ConfigDouble(0.8) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.max_depth=ConfigInt(7) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.tree_grow_policy=ConfigString("loss") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.sample_dependent_base_prediction=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.regularization.learning_rate=ConfigDouble(0.1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.silent=ConfigInt(1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.dump_freq=ConfigInt(-1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - fs_scheme=ConfigString("local") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.tree_maker=ConfigString("data") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - feature.approximate=SimpleConfigList([{"cols":"default","type":"no_sample"}]) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - feature.missing_value=ConfigString("value") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.histogram_pool_capacity=ConfigInt(-1) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.train.data_path=ConfigString("data/flt/train.ytklearn") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.train.max_error_tol=ConfigInt(0) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - model.data_path=ConfigString("config/model/gbdt.model") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.loss_function=ConfigString("sigmoid") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.instance_sample_rate=ConfigDouble(0.8) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - feature.filter_threshold=ConfigInt(0) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.unassigned_mode=ConfigString("lines_avg") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.uniform_base_prediction=ConfigDouble(0.5) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.assigned=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.just_evaluate=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - data.test.data_path=ConfigString("data/flt/test.ytklearn") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - feature.split_type=ConfigString("mean") 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.eval_metric=SimpleConfigList(["confusion_matrix","auc"]) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.round_num=ConfigInt(300) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - verbose=ConfigBoolean(false) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - optimization.max_abs_leaf_val=ConfigInt(-1) 2017.12.27 13:54:11 com.fenbi.ytklearn.worker.TrainWorker - file system uri:local, URI:local, URI tostring:local 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - commonParams:GBDTCommonParams(verbose=false, dataParams=DataParams(train=DataParams.Train(data_path=data/flt/train.ytklearn, max_error_tol=0), test=DataParams.Test(data_path=data/flt/test.ytklearn, max_error_tol=0), delim=DataParams.Delim(x_delim=###, y_delim=,, features_delim= , feature_name_val_delim=:), y_sampling=[], assigned=false, unassigned_mode=lines_avg), max_feature_dim=40, modelParams=GBDTModelParams(data_path=config/model/gbdt.model, need_dict=false, dict_path=config/model/feat_dict, dump_freq=2147483647, continue_train=false, feature_importance_path=config/model/feature_importance), featureParams=GBDTFeatureParams(split_Type=MEAN, enable_missing_value=true, featureMissingParams=value, needFeaAppro=true, feaApproConfList=[Config(SimpleConfigObject({"cols":"default","type":"no_sample"}))], featureApproximateParamList=null, verbose=false, filter_threshold=0), optimizationParams=GBDTOptimizationParams(learn_type=gradient_boosting, tree_maker_type=DATA_PARALLEL, round_num=300, max_depth=7, min_child_hessian_sum=1.0, max_leaf_cnt=16, min_split_loss=0.0, min_split_samples=-1, objective=sigmoid, sigmoid_zmax=0.0, max_abs_leaf_val=-1.0, lad_refine_appr=false, tree_grow_policy=LOSSCHG_WISE, histogram_pool_capacity=-1.0, regularization=GBDTOptimizationParams.Regularization(l1=0.0, l2=1.0, learningRate=0.1), uniform_base_prediction=0.5, sample_dependent_base_prediction=false, subsample=0.8, feature_sample_rate=0.8, class_num=1, just_evaluate=false, eval_metrics=[confusion_matrix, auc], watch_train=false, watch_test=false, verbose=false)) 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - have no dict, we will collect feature dict... 2017.12.27 13:54:11 com.fenbi.mp4j.rpc.Server - #########read train data############ 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=1] has readed lines:10000 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=2] has readed lines:10000 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=3] has readed lines:10000 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=5] has readed lines:10000 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=0] has readed lines:10000 2017.12.27 13:55:05 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=4] has readed lines:10000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=4] has readed lines:20000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=1] has readed lines:20000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=0] has readed lines:20000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=2] has readed lines:20000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=3] has readed lines:20000 2017.12.27 13:55:47 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=5] has readed lines:20000 2017.12.27 13:56:25 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=1] has readed lines:30000 2017.12.27 13:56:25 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=4] has readed lines:30000 2017.12.27 13:56:25 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=3] has readed lines:30000 2017.12.27 13:56:25 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=0] has readed lines:30000 2017.12.27 13:56:30 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=2] has readed lines:30000 2017.12.27 13:56:30 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=5] has readed lines:30000 2017.12.27 13:57:07 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=4] has readed lines:40000 2017.12.27 13:57:07 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=3] has readed lines:40000 2017.12.27 13:57:10 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=0] has readed lines:40000 2017.12.27 13:57:10 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=2] has readed lines:40000 2017.12.27 13:57:10 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=1] has readed lines:40000 2017.12.27 13:57:10 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=5] has readed lines:40000 2017.12.27 13:57:58 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=4] has readed lines:50000 2017.12.27 13:57:58 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=2] has readed lines:50000 2017.12.27 13:57:58 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=3] has readed lines:50000 2017.12.27 13:58:01 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=1] has readed lines:50000 2017.12.27 13:58:01 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=5] has readed lines:50000 2017.12.27 13:58:01 com.fenbi.mp4j.rpc.Server - [rank=0] [threadId=0] has readed lines:50000 2017.12.27 14:19:28 com.fenbi.mp4j.rpc.Server - [ERROR] waiting for heartbeat timeout > 600000, master will be shutdowned!

marshalWS commented 6 years ago

closed !