microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

LightGBMClassifier high runtime on test dataset #1127

Open ankeshjha999 opened 3 years ago

ankeshjha999 commented 3 years ago

Hi,

I was trying out mmlspark to run some test workloads on a spark cluster. I started out with the "Bankruptcy Prediction with LightGBM Classifier" example given in the LightGBM - Overview notebook.

The runtime of this example was > 30 mins on the cluster, even though the dataset used in the example just has 6819 rows. I also checked that a single machine LGB classifier with default parameter, trains on the same dataset in < 1 sec.

The code that I'm running is below (same as the example in the notebook) -

df = spark.read.format("csv").option("header", True).option("inferSchema", True).load("wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv")

print("records read: " + str(df.count()))

print("Schema: ")
df.printSchema()

train, test = df.randomSplit([0.85, 0.15], seed=1)

from pyspark.ml.feature import VectorAssembler
feature_cols = df.columns[1:]
featurizer = VectorAssembler(
    inputCols=feature_cols,
    outputCol='features'
)
train_data = featurizer.transform(train)['Bankrupt?', 'features']
test_data = featurizer.transform(test)['Bankrupt?', 'features']

from mmlspark.lightgbm import LightGBMClassifier
model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="Bankrupt?", isUnbalance=True)

model = model.fit(train_data)

I used 2 executors (1 GB memory + 2 vcores each) to run the example on spark 2.4. Also I'm using mmlspark_2.11:1.0.0-rc3

Would request your help here to understand if this is the expected behaviour, or if I'm doing anything wrong,

AB#1984519

welcome[bot] commented 3 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

TomFinley commented 3 years ago

@ankeshjha999 thank you for writing. To answer your question that is absolutely not typical. Like most samples, it is created to run in a matter of seconds, not minutes, much less 30 minutes. If I try re-running it on even the weakest and most modest machine configurations I can even provision through Azure, it finishes in about 8 or 10 seconds.

So something is wrong, but at least to me, it is unclear what. Nothing you have said raises any alarm bells to me right off the bat as being obviously wrong. Is there more information you can provide about your environment? Anything out of the ordinary about it?

TomFinley commented 3 years ago

Hello @ankeshjha999 it's been almost two weeks, any follow-up? It would be nice to resolve the issue, if there is one.

ankeshjha999 commented 3 years ago

Hi @TomFinley Apologies for the delay in response. Some more details -

  1. I tried running the example notebook in Cloudera hadoop stack. The spark version more specifically is spark 2.4.0-cdh6.3.1 (cloudera spark).
  2. Also the step which takes which takes all the time is reduce at LightGBMBase.scala:228. Sometimes the stage also finishes faster, but is always in the minutes scale (2.5 mins in the below trial) Screenshot 2021-08-02 at 9 51 46 PM

Please let me know if there's any specific information that you want to explore this further.

imatiach-msft commented 3 years ago

@ankeshjha999 The training is done at that step so it is expected for reduce to take the longest amount of time, but 2.5 minutes does seem long for such a small dataset, and 30 minutes definitely is excessively long. Do you have the output for the run from the cluster? It should print out after each iteration during training completes and might help us understand in which phase we are spending the most time (eg initialization, training, etc?).