Open Shafi2016 opened 2 years ago
The line 396 X=X.copy()
can be removed from data.py. Would you like to create a PR to remove that line and test it?
Thank you!, Would you please elaborate more? How to proceed with data.py. Sorry not familiar with PR. I need to convert the spark data frame to Pandas or numpy. So When I convert to Pandas I get memory issues and the notebook restarted. I get a message [Errno 111] Connection refused. So changing data.py will be the most viable option.
Thank you!, Would you please elaborate more? How to proceed with data.py. Sorry not familiar with PR. I need to convert the spark data frame to Pandas or numpy. So When I convert to Pandas I get memory issues and the notebook restarted. I get a message [Errno 111] Connection refused. So changing data.py will be the most viable option.
PR = pull request. The first step is to remove line 396 in data.py and try again. There could be new issues after that. Please let me know if that's the case. If you don't know how to do this and would like help, please share the dataset and use case so that others can test.
FLAML
Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.
Hi, I'm afraid you're trying something impossible. FLAML is build upon the in-memory python data stack and interacts beautifully with pandas, numpy, scikit-learn... On the other hand, (py)spark data frames can't easily be used interchangeably with this stack. Thus, training or predicting a FLAML (or scikit, tensor flow, pytorch...) estimator with a pyspark data frame simply does not work.
That's why everything works if you use random X, y generated by scikit and not if you substitute them with your original pyspark data frame. Often, errors only occur late in your code, as pyspark has the property of lazy evaluation and often won't complain until you actually demand to see or calculate a result.
What are you're options.
FLAML
Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.
AnalysisException Traceback (most recent call last)
/tmp/ipykernel_33/1041752136.py in
/opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list:
/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args:
/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise
AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv
I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning.
@markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.
FLAML
Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.
I got the following error when running this notebook:
AnalysisException Traceback (most recent call last) /tmp/ipykernel_33/1041752136.py in 13 .option("header", first_row_is_header) 14 .option("sep", delimiter) ---> 15 .load(file_location) 16 df.limit(2).toPandas().head() 17
/opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list:
/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args:
/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise
AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv
I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning.
@markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.
Thank You, @MichaelMarien and @sonichi. I have tested already with pandas parquet. I get the memory issues. I think this may be a limitation of FLAML with Big Data. The integration of FLAML with synapseml might be a better idea. There is no memory issue when I used the same data with LightGBMClassifier with synapseml.
FLAML
Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.
I got the following error when running this notebook:
AnalysisException Traceback (most recent call last) /tmp/ipykernel_33/1041752136.py in 13 .option("header", first_row_is_header) 14 .option("sep", delimiter) ---> 15 .load(file_location) 16 df.limit(2).toPandas().head() 17 /opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list: /opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args: /opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning. @markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.
Thank You, @MichaelMarien and @sonichi. I have tested already with pandas parquet. I get the memory issues. I think this may be a limitation of FLAML with Big Data. The integration of FLAML with synapseml might be a better idea. There is no memory issue when I used the same data with LightGBMClassifier with synapseml.
@Shafi2016 thanks for confirming that. Since LightGBMClassifier with synapseml works, you can modify this example to do hyperparameter tuning for synapseml.LightGBMclassifier before the integration is made. You might need to rename the keys in params
based on the parameter name correspondence between lightgbm and synapseml.LightGBMClassifier.
I m using autoML(FLAML) with Spark on large data. The error image is given below
Everything works fine up to the above point. Now when I predict on test data using as