Open ankeshjha999 opened 3 years ago
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@ankeshjha999 thank you for writing. To answer your question that is absolutely not typical. Like most samples, it is created to run in a matter of seconds, not minutes, much less 30 minutes. If I try re-running it on even the weakest and most modest machine configurations I can even provision through Azure, it finishes in about 8 or 10 seconds.
So something is wrong, but at least to me, it is unclear what. Nothing you have said raises any alarm bells to me right off the bat as being obviously wrong. Is there more information you can provide about your environment? Anything out of the ordinary about it?
Hello @ankeshjha999 it's been almost two weeks, any follow-up? It would be nice to resolve the issue, if there is one.
Hi @TomFinley Apologies for the delay in response. Some more details -
reduce at LightGBMBase.scala:228
. Sometimes the stage also finishes faster, but is always in the minutes scale (2.5 mins in the below trial)
Please let me know if there's any specific information that you want to explore this further.
@ankeshjha999 The training is done at that step so it is expected for reduce to take the longest amount of time, but 2.5 minutes does seem long for such a small dataset, and 30 minutes definitely is excessively long. Do you have the output for the run from the cluster? It should print out after each iteration during training completes and might help us understand in which phase we are spending the most time (eg initialization, training, etc?).
Hi,
I was trying out mmlspark to run some test workloads on a spark cluster. I started out with the "Bankruptcy Prediction with LightGBM Classifier" example given in the LightGBM - Overview notebook.
The runtime of this example was > 30 mins on the cluster, even though the dataset used in the example just has 6819 rows. I also checked that a single machine LGB classifier with default parameter, trains on the same dataset in < 1 sec.
The code that I'm running is below (same as the example in the notebook) -
I used 2 executors (1 GB memory + 2 vcores each) to run the example on spark 2.4. Also I'm using mmlspark_2.11:1.0.0-rc3
Would request your help here to understand if this is the expected behaviour, or if I'm doing anything wrong,
AB#1984519