Open jovis-gnn opened 1 year ago
Hey @jovis-gnn :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
Thanks, jovis-gnn for reporting this.
The exception NoSuchElementException seems to be the result of not being able to compute an RDD partition or read it from a checkpoint because the attempt to connect to the Spark driver is failing with exception java.net.ConnectException. Can you please check along this line on the EMR cluster? In the meantime, we will investigate this further on our end because, I also see a possibility to address the second half of this scenario on our end.
@svotaw can you also please take a look? Looks like we need to handle NoSuchElementException in Data Aggregator.
@saileshbaidya Thanks for reply I found the reason of this Exception. Even though there were no "True" values in validation column(sample data) but I specified validationIndicatorCol, so lightgbm module returned NoSuchElementException. After changing some of those values to "True", it resolved.
By the way, after solving the problem, data collecting stage pending for long time(never end). But after repartitioning the training set to 1, it worked. Could you help me about the reason(minimum / maximum number of partition for training or something)?
DataAggregator is being deprecated, so we won't mess with that. The newer "streaming" mode is available in our latest releases. Please ask for a copy if you want to try that (no official version yet with latest fixes).
LightGBM algorithm does not work with auto-scaled clusters, so please turn off any scaling. Also, it helps to set "spark.dynamicAllocation.enabled": "false".
You are likely hitting scaling problems which affect networking (which look like hangs). By repartitioning to 1, you are removing networking (only using 1 node). You can try smaller numbers to improve hangs with the version you have.
We have released 11.2, which has the final streaming features.
SynapseML version
0.10.2
System information
Describe the problem
I'm testing lightgbm on EMR cluster. I tried to create sample dataset and fit the dataset to LightGBMRanker model. I've got some errors and it seems to have some problem collecting dataset. Please give me some feedback if you have some idea...
Thank you.
Code to reproduce issue
Other info / logs
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrations