microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 832 forks source link

code block LightGBMClassifier fit on yarn #936

Open rusonding opened 4 years ago

rusonding commented 4 years ago

Describe the bug

  1. code block in pipeline_model.fit(), No progress, spark stage always 0
  2. csv data : column 200, row 800000
  3. one centos compute train cost time: 1min

To Reproduce spark2-submit --master yarn --jars file:///root/.ivy2/jars/com.microsoft.ml.spark_mmlspark_2.11-1.0.0-rc1.jar,file:///root/.ivy2/jars/com.microsoft.ml.lightgbm_lightgbmlib-2.3.100.jar --conf spark.pyspark.python=/usr/lib/anaconda2/envs/mmlspark/bin/python --num-executors 20 --executor-memory 10G test_mmlspark2.py

Expected behavior A clear and concise description of what you expected to happen.

Info (please complete the following information):

Stacktrace

client:
[Stage 6:>                                                        (0 + 20) / 20] 

web:
![image](https://user-images.githubusercontent.com/21210508/96409909-a31c7b80-1218-11eb-9384-2bf6b3ebee6a.png)

yarn log:

[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12456...
[LightGBM] [Info] Binding port 12456 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Connected to rank 3
[LightGBM] [Info] Connected to rank 4
[LightGBM] [Info] Connected to rank 5
[LightGBM] [Info] Connected to rank 6
[LightGBM] [Info] Connected to rank 7
[LightGBM] [Info] Connected to rank 8
[LightGBM] [Info] Connected to rank 9
[LightGBM] [Info] Connected to rank 10
[LightGBM] [Info] Connected to rank 11
[LightGBM] [Info] Connected to rank 12
[LightGBM] [Info] Connected to rank 13
[LightGBM] [Info] Connected to rank 14
[LightGBM] [Info] Connected to rank 15
[LightGBM] [Info] Connected to rank 16
[LightGBM] [Info] Connected to rank 17
[LightGBM] [Info] Connected to rank 19
[LightGBM] [Info] Local rank: 18, total number of machines: 20
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 173288, number of negative: 660791
[LightGBM] [Info] Total Bins 9954
[LightGBM] [Info] Number of data: 41899, number of used features: 196
[LightGBM] [Debug] Use subset for bagging
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.204635 -> initscore=-1.357573
[LightGBM] [Info] Start training from score -1.340715
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train
[LightGBM] [Debug] Trained a tree with leaves = 8 and max_depth = 3
[LightGBM] [Debug] Re-bagging, using 29329 data to train

**python code**

# coding=UTF-8

import numpy as np
import pyspark

spark = pyspark.sql.SparkSession.builder.appName("spark lightgbm") \
    .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
    .getOrCreate()

spark.conf.set("spark.executor.memory", '18g')
spark.conf.set("spark.executor.cores", '20')
spark.conf.set("park.default.parallelism", '300')
spark.conf.set("spark.cores.max", '30')
spark.conf.set("spark.driver.memory",'18g')
spark.conf.set("spark.yarn.executor.memoryOverhead",'10g')

import mmlspark
from mmlspark.lightgbm import LightGBMClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

df_train = spark.read.format("csv") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .option("sep", ",") \
  .load("/model_data.csv")

feature_cols = list(df_train.columns)
feature_cols.remove("label")  
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features") 

for colName in df_train.columns:
  print(colName)
  df_train = df_train.withColumn(colName, df_train[colName].cast('float')) 

df_train = df_train.na.fill(0)

lgb = LightGBMClassifier(
    objective="binary",
    boostingType='gbdt',
    isUnbalance=True,
    featuresCol='features',
    labelCol='label',
    maxBin=60,
    baggingFreq=1,
    baggingSeed=696,
    earlyStoppingRound=30,
    learningRate=0.1,
    lambdaL1=1.0,
    lambdaL2=45.0,
    maxDepth=3,
    numLeaves=128,
    baggingFraction=0.7,
    featureFraction=0.7,
    # minSumHessianInLeaf=1,
    numIterations=800,
    verbosity=30
)

stages = [assembler, lgb]

pipeline_model = Pipeline(stages=stages)

print("**********fit***************")
model = pipeline_model.fit(df_train)

print("**********transform***************")
train_preds = model.transform(df_train)
kiminh commented 3 years ago

encountering same issue