AttributeError: 'DataFrame' object has no attribute 'copy'

Shafi2016 commented 2 years ago

I m using autoML(FLAML) with Spark on large data. The error image is given below

train = spark.read.parquet("./train.parquet")
test = spark.read.parquet("./test.parquet")

input_cols = [c for c in train.columns if c != 'target']
vectorAssembler = VectorAssembler(inputCols = input_cols, outputCol = 'features')
vectorAssembler.setHandleInvalid("skip").transform(train).show
train_sprk = vectorAssembler.transform(train)
test_sprk = vectorAssembler.transform(test)

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
y = train_sprk["target"]
X = train_sprk[input_cols]
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)

from flaml import AutoML

automl = AutoML()

from flaml import logger
import logging
logger.setLevel(logging.WARNING)

settings = {
    "time_budget": 200,  # total running time in seconds
    "metric": 'roc_auc',  # can be: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr',
    "estimator_list": ['lgbm', 'xgboost'],                     # 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'
    "task": 'classification',  # task type
    "log_file_name": 'airlines_experiment.log',  # flaml log file
    "seed": 22,
    "verbose" : 0

           # random seed 22  786
}
automl.fit(X_train=X_train, y_train=y_train, **settings)
'''retrieve best config and best learner'''
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

Everything works fine up to the above point. Now when I predict on test data using as

y_pred = automl.predict(test_sprk)
print('Predicted labels', y_pred)

sonichi commented 2 years ago

The line 396 X=X.copy() can be removed from data.py. Would you like to create a PR to remove that line and test it?

Shafi2016 commented 2 years ago

Thank you!, Would you please elaborate more? How to proceed with data.py. Sorry not familiar with PR. I need to convert the spark data frame to Pandas or numpy. So When I convert to Pandas I get memory issues and the notebook restarted. I get a message [Errno 111] Connection refused. So changing data.py will be the most viable option.

sonichi commented 2 years ago

Thank you!, Would you please elaborate more? How to proceed with data.py. Sorry not familiar with PR. I need to convert the spark data frame to Pandas or numpy. So When I convert to Pandas I get memory issues and the notebook restarted. I get a message [Errno 111] Connection refused. So changing data.py will be the most viable option.

PR = pull request. The first step is to remove line 396 in data.py and try again. There could be new issues after that. Please let me know if that's the case. If you don't know how to do this and would like help, please share the dataset and use case so that others can test.

Shafi2016 commented 2 years ago

FLAML

Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.

MichaelMarien commented 2 years ago

Hi, I'm afraid you're trying something impossible. FLAML is build upon the in-memory python data stack and interacts beautifully with pandas, numpy, scikit-learn... On the other hand, (py)spark data frames can't easily be used interchangeably with this stack. Thus, training or predicting a FLAML (or scikit, tensor flow, pytorch...) estimator with a pyspark data frame simply does not work.

That's why everything works if you use random X, y generated by scikit and not if you substitute them with your original pyspark data frame. Often, errors only occur late in your code, as pyspark has the property of lazy evaluation and often won't complain until you actually demand to see or calculate a result.

What are you're options.

bring your pyspark data frames to pandas, most stable is saving to parquet and loading with pandas.read_parquet (install pyarrow) if your data can fit in memory (perhaps otherwise sample?). toPandas is an in-memory alternative, but won't work for larger data frames.
for predicting, you can use UDFs in pyspark: see for instance https://medium.com/civis-analytics/prediction-at-scale-with-scikit-learn-and-pyspark-pandas-udfs-51d5ebfb2cd8
For training, it won't be easy to use FLAML to train directly on a pyspark data frame I'm afraid :(

sonichi commented 2 years ago

FLAML

Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.

I got the following error when running this notebook:

AnalysisException Traceback (most recent call last) /tmp/ipykernel_33/1041752136.py in 13 .option("header", first_row_is_header) \ 14 .option("sep", delimiter) \ ---> 15 .load(file_location) 16 df.limit(2).toPandas().head() 17

/opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list:

/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args:

/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise

AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv

I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning.

@markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.

Shafi2016 commented 2 years ago

FLAML

Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.

I got the following error when running this notebook:

AnalysisException Traceback (most recent call last) /tmp/ipykernel_33/1041752136.py in 13 .option("header", first_row_is_header) 14 .option("sep", delimiter) ---> 15 .load(file_location) 16 df.limit(2).toPandas().head() 17

/opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list:

/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args:

/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise

AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv

I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning.

@markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.

Thank You, @MichaelMarien and @sonichi. I have tested already with pandas parquet. I get the memory issues. I think this may be a limitation of FLAML with Big Data. The integration of FLAML with synapseml might be a better idea. There is no memory issue when I used the same data with LightGBMClassifier with synapseml.

sonichi commented 2 years ago

FLAML

Thank you, I m not sure how to proceed further. I need help. Here is the error reproducible notebook.

I got the following error when running this notebook:

AnalysisException Traceback (most recent call last) /tmp/ipykernel_33/1041752136.py in 13 .option("header", first_row_is_header) 14 .option("sep", delimiter) ---> 15 .load(file_location) 16 df.limit(2).toPandas().head() 17 /opt/conda/lib/python3.7/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, options) 175 self.options(options) 176 if isinstance(path, str): --> 177 return self._df(self._jreader.load(path)) 178 elif path is not None: 179 if type(path) != list: /opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, args) 1320 answer = self.gateway_client.send_command(command) 1321 return_value = get_return_value( -> 1322 answer, self.gateway_client, self.target_id, self.name) 1323 1324 for temp_arg in temp_args: /opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(a, **kw) 194 # Hide where the exception came from that shows a non-Pythonic 195 # JVM exception message. --> 196 raise converted from None 197 else: 198 raise AnalysisException: Path does not exist: file:/kaggle/input/anomaly-sub/train_sub.csv I agree with @MichaelMarien that converting to pandas dataframe is the first recommended approach. If you have to work with pyspark dataframe for training, consider the following option. As long as you can make training one model works for a fixed configuration, you can wrap it up with a user-defined function and leverage the tuning API to perform hyperparameter tuning. @markusweimer Please chime in if you have suggestions or if you think this is a motivating example of a deeper integration of flaml and synapseml.

Thank You, @MichaelMarien and @sonichi. I have tested already with pandas parquet. I get the memory issues. I think this may be a limitation of FLAML with Big Data. The integration of FLAML with synapseml might be a better idea. There is no memory issue when I used the same data with LightGBMClassifier with synapseml.

@Shafi2016 thanks for confirming that. Since LightGBMClassifier with synapseml works, you can modify this example to do hyperparameter tuning for synapseml.LightGBMclassifier before the integration is made. You might need to rename the keys in params based on the parameter name correspondence between lightgbm and synapseml.LightGBMClassifier.

microsoft / FLAML