microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.04k stars 828 forks source link

LightGBMClassificationModel.fit gives raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty #1325

Open Nitinsiwach opened 2 years ago

Nitinsiwach commented 2 years ago

Describe the bug LightGBMClassificationModel.fit Cannot handle too much data. Fails without even having to collect anything at the driver. I LightGBMClassificationModel.fit on data(10000,241) - It executes perfectly I LightGBMClassificationModel.fit on data(100000,241) - It executes perfectly I LightGBMClassificationModel.fit on data(1000000,241) - The error shows up

I do not ever get WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation..

To Reproduce I can upload an entire example if necessary but since this error shows up only when I read all the rows in my data (I keep all the other configuration same) I hope it would not be necessary. Please do let me know in case it is needed

Expected behavior The execution can be slow by switching to disk caching but it should not fail IMO.

Info (please complete the following information): Python 3.9.7 pyspark 3.2.0 (pip install pyspark) sc session:

pyspark.sql.SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .config("spark.driver.memory", "8g")\
            .getOrCreate()

Stacktrace

----------------------------------------=====>                     (8 + 5) / 13]
Exception occurred during processing of request from ('127.0.0.1', 35340)
Traceback (most recent call last):
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/socketserver.py", line 747, in __init__
    self.handle()
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/accumulators.py", line 262, in handle
    poll(accum_updates)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/accumulators.py", line 235, in poll
    if func():
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/accumulators.py", line 239, in accum_updates
    num_updates = read_int(self.rfile)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/clientserver.py", line 480, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/nitin/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/clientserver.py", line 503, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
/tmp/ipykernel_1227645/3903106392.py in <module>
      1 # lp = LineProfiler()
----> 2 scores = est.train_evaluate_cv('./mar19/training_data.csv', 'hash_CR_ACCOUNT_NBR', \
      3                                'flag__6_months', None, save_results=True,\
      4                               evaluate = True)

~/pymonsoon/./ml_auto_spark/cv_estimator.py in train_evaluate_cv(self, data_path, index, label, nrows, save_results, evaluate)
    308         df_modeling = self.prepare_modeling_data(df, self.n_splits, train=True).persist()
    309         print("Training Models")
--> 310         self.train_cv(df_modeling)
    311         print("Computing CV save artefacts")
    312         df_post_prediction = self.predict_oof_cv(df_modeling, evaluate=evaluate)

~/pymonsoon/./ml_auto_spark/cv_estimator.py in train_cv(self, df)
    131             self.run_params.update(update_params)
    132             model = self.model.setParams(**self.run_params)
--> 133             model = model.fit(df)
    134             self.trained_models.append(model)
    135 

~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
    159                 return self.copy(params)._fit(dataset)
    160             else:
--> 161                 return self._fit(dataset)
    162         else:
    163             raise TypeError("Params must be either a param map or a list/tuple of param maps, "

/tmp/spark-6102c3e3-6d12-4007-8b72-8a4d20f8e325/userFiles-a7fdb181-3cd5-4dd2-b745-f59b96544e25/com.microsoft.azure_synapseml-lightgbm_2.12-0.9.4.jar/synapse/ml/lightgbm/LightGBMClassifier.py in _fit(self, dataset)
   1445 
   1446     def _fit(self, dataset):
-> 1447         java_model = self._fit_java(dataset)
   1448         return self._create_model(java_model)
   1449 

~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    330         """
    331         self._transfer_params_to_java()
--> 332         return self._java_obj.fit(dataset._jdf)
    333 
    334     def _fit(self, dataset):

~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    332                     format(target_id, ".", name, value))
    333         else:
--> 334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
    336                 format(target_id, ".", name))

Py4JError: An error occurred while calling o52.fit

Additional context

AB#1984487

imatiach-msft commented 2 years ago

@Nitinsiwach really sorry, from the given information I really don't know what the issue could be. The "root" error message seems to be "py4j.protocol.Py4JNetworkError: Answer from Java side is empty". I'm really not sure how it's related to lightgbm classification model. I do see that you said it started failing on more data. Perhaps it ran into OOM, and only increasing number of nodes or machine RAM might help.