snowflakedb / snowflake-ml-python

Apache License 2.0
37 stars 7 forks source link

XGBRegressor fit method numpy type error #96

Closed PrestonBlackburn closed 2 months ago

PrestonBlackburn commented 3 months ago

Hey, I was having an issue with running the XGBRegressor fit method throwing a numpy type error. Snowpark ml installs numpy version 1.24, but when I run the .fit() method I get an error related to the numpy version (float type depreciation)

Original Code (simplified, but still errors)

from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.snowpark.types import DecimalType, IntegerType, DoubleType
from snowflake.snowpark.functions import cast

df_typed = df.select(cast(df["SEASON"], IntegerType()).as_("SEASON"),
                     cast(df["HOLIDAY"], IntegerType()).as_("HOLIDAY"),
                     cast(df["WORKINGDAY"], IntegerType()).as_("WORKINGDAY"),
                      cast(df["COUNT"], DoubleType()).as_("COUNT"),

param_grid = {
        "max_depth":[3, 4, 5, 6, 7, 8],
        "min_child_weight":[1, 2, 3, 4],
}

grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid=param_grid,
    n_jobs = -1,
    scoring="neg_root_mean_squared_error",
    input_cols=["SEASON", "HOLIDAY", "WORKINGDAY"],
    label_cols=["COUNT"],
    output_cols=['PREDICTED_COUNT']
)

Error

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

environment python version: 3.8.3

snowflake-ml-python==1.4.0
snowflake-snowpark-python==1.14.0
snowflake-connector-python==3.6.0
numpy==1.24.4

If I change my numpy version to <1.20 then GridSearchCV throws an error when trying to import:

cannot import name 'GridSearchCV' from 'snowflake.ml.modeling.model_selection'
sfc-gh-afero commented 3 months ago

Hi @PrestonBlackburn - I'd be happy to help you. In an attempt to reproduce, I made a script that loads sample sklearn data, does a similar snowflake casting, and runs the grid search fit.

I was unable to reproduce your error with the script I made below, so there may be an interaction with your particular data. Two questions:

  1. Do you know what line of code the error is being thrown from?
  2. Is it possible to share a fully reproducible example so that I can reproduce on my end?

In an attempt to find the deprecated attributes, I did a search for np.float in our code base but I don't see any references, so I'd be curious to see where the error is being raised from.

My repro:

From pip freeze:

numpy==1.24.4
snowflake-connector-python==3.6.0
snowflake-ml-python==1.4.0
snowflake-snowpark-python==1.14.0

Script:

from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.snowpark.types import DecimalType, IntegerType, DoubleType
from snowflake.snowpark.functions import cast
from snowflake.snowpark import Session
from sklearn.datasets import fetch_california_housing

INPUT_COLS = ["MEDINC", "AVEROOMS", "LATITUDE", "LONGITUDE"]
LABEL_COLS = ["MEDHOUSEVAL"]

session = Session.builder.create()

def load_housing_data():
    input_df_pandas = fetch_california_housing(as_frame=True).frame
    input_df_pandas.columns = [c.upper() for c in input_df_pandas.columns]
    input_df = session.create_dataframe(input_df_pandas)

    return input_df

df = load_housing_data()

df_typed = df.select(cast(df["MEDINC"], DoubleType()).as_("MEDINC"),
                     cast(df["AVEROOMS"], IntegerType()).as_("AVEROOMS"),
                     cast(df["LATITUDE"], DoubleType()).as_("LATITUDE"),
                      cast(df["LONGITUDE"], DoubleType()).as_("LONGITUDE"),
                      cast(df["MEDHOUSEVAL"], DoubleType()).as_("MEDHOUSEVAL")
                      )

param_grid = {
        "max_depth":[3, 4, 5, 6, 7, 8],
        "min_child_weight":[1, 2, 3, 4],
}

grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid=param_grid,
    n_jobs = -1,
    scoring="neg_root_mean_squared_error",
    input_cols=INPUT_COLS,
    label_cols=LABEL_COLS,
    output_cols=['PREDICTED_COUNT']
)

grid_search.fit(df_typed)
PrestonBlackburn commented 2 months ago

Hey, thanks for following up. I tested the same code with the same Python 3.8.3 version and requirements in a separate conda env, but when I did that, it worked. I guess it must have been some sort of issue with that particular environment.

Here is the full error, but since I can't reproduce it in another environment, I think this can probably be closed.


Traceback (most recent call last):
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 367, in wrap
    res = func(*args, **kwargs)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\model_selection\grid_search_cv.py", line 331, in fit
    self._sklearn_object = model_trainer.train()
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\snowpark_implementations\snowpark_trainer.py", line 433, in train
    fit_wrapper_sproc = self._get_fit_wrapper_sproc(statement_params=statement_params)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\snowpark_implementations\snowpark_trainer.py", line 253, in _get_fit_wrapper_sproc
    model_spec = ModelSpecificationsBuilder.build(model=self.estimator)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\model_specifications.py", line 132, in build
    return SklearnModelSelectionModelSpecifications()
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\model_specifications.py", line 97, in __init__
    import lightgbm
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\__init__.py", line 8, in <module> 
    from .basic import Booster, Dataset, register_logger
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\basic.py", line 17, in <module>   
    from .compat import PANDAS_INSTALLED, concat, dt_DataTable, is_dtype_sparse, pd_DataFrame, pd_Series
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\compat.py", line 114, in <module> 
    from dask.array import Array as dask_Array
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\__init__.py", line 3, in <module>
    from .core import (
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\core.py", line 22, in <module>  
    from . import chunk
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\chunk.py", line 7, in <module>  
    from . import numpy_compat as npcompat
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\numpy_compat.py", line 21, in <module>
    np.divide(0.4, 1, casting="unsafe", dtype=np.float),
  File "C:\Users\Preston\anaconda3\lib\site-packages\numpy\__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "issue_test.py", line 65, in <module>
    grid_search.fit(df_typed)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 389, in wrap
    raise me.original_exception from e
AttributeError: (0000) module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
sfc-gh-afero commented 2 months ago

@PrestonBlackburn Thanks for sharing the full error. This provides many more clues as to what happened. This error is actually being raised when calling import lightgbm, so the root of the issue is likely that your version of lightgbm is not compatible with numpy 1.24.4. On the environment that's failing, what version of lightgbm do you have installed?

You may be wondering, why are we importing lightgbm at all since you're running an xgboost model?

When we execute the fit method from the snowflake-ml client, we are often implementing that as a stored procedure in Snowflake. In some cases, as an optimization we re-use the same stored procedure multiple times for consecutive grid searches. Because grid search is a "composed" estimator, it can be run with lots of different types of models, including both xgboost and lightgbm. As such, when we create the stored procedure we include both xgboost and lightgbm as dependencies, if lightgbm is available in the client's environment (code). I expect that your new conda env worked for one of two reasons: 1) you did not install lightgbm at all or 2) you used a different version that is compatible with numpy >1.20

sfc-gh-afero commented 2 months ago

@PrestonBlackburn Marking as closed because this seems to be an incompatibility in your local python environment; if you are able to reproduce with a version of lightgbm we support please do update us