Snowpark XGBRegressor Ignores Sample Weights, Producing Identical Predictions for Different Models #111

Open robertlessmore opened 1 month ago

robertlessmore commented 1 month ago
  1. What version of Python are you using?

Python 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]

What operating system and processor architecture are you using? Windows-10-10.0.22631-SP0

  1. What are the component versions in the environment?

  1. What did you do? from import XGBRegressor from snowflake.snowpark.functions import col, random, sin, when, lit from utils import get_session

session = get_session.session()

N = 105 _ONE_MILLION = 106

df = session.range(1, N).to_df("ind").with_column( "x_0", ((random() % _ONE_MILLION)/_ONE_MILLION) )

df = df

df = df.with_columns(["weights1","weights2","weights3"],[lit(1.0),when(col("ind") < lit(N / 10), 1.0).otherwise(0.0),when(col("ind") > lit(N / 10), 1.0).otherwise(0.0)])

df = df.with_column( "target", when(col("ind") < lit(N / 10), 1.0).otherwise(0.0) col("x_0") + when(col("ind") > lit(N / 10), 1.0).otherwise(0.0) sin(10*col("x_0")) )

parameters = { "input_cols":["X_0"], "label_cols":["TARGET"], }

model1 = XGBRegressor(
**parameters, sample_weight_col="weights1", output_cols= ["PREDICTION1"],

) model2 = XGBRegressor(
**parameters, sample_weight_col="weights2", output_cols= ["PREDICTION2"],

) model3 = XGBRegressor(
**parameters, sample_weight_col="weights3", output_cols= ["PREDICTION3"],


models = [model1, model2, model3] for m in models:

test = session.range(-1, 1,0.01).to_df("X_0").with_column( "sinus", sin(10*col("X_0")) )

for m in models: test = m.predict(test)

test_snow = test.toPandas() print(test_snow)

output: X_0 SINUS PREDICTION1 PREDICTION2 PREDICTION3 0 -1.00 0.544021 0.515664 0.515664 0.515664 1 -0.99 0.457536 0.405519 0.405519 0.405519 2 -0.98 0.366479 0.183660 0.183660 0.183660 3 -0.97 0.271761 0.211220 0.211220 0.211220 4 -0.96 0.174327 0.039056 0.039056 0.039056 .. ... ... ... ... ... 195 0.95 -0.075151 0.047328 0.047328 0.047328 196 0.96 -0.174327 -0.060364 -0.060364 -0.060364 197 0.97 -0.271761 0.034832 0.034832 0.034832 198 0.98 -0.366479 -0.278535 -0.278535 -0.278535 199 0.99 -0.457536 -0.390598 -0.390598 -0.390598

  1. What did you expect to see? I expected different models to produce different predictions due to the varying sample weights (weights1, weights2, weights3). Specifically:

However, the Snowflake Snowpark implementation of XGBRegressor seems to ignore the sample weights, resulting in identical predictions for all models. Running a similar experiment directly with the standard xgboost library outside of Snowflake results in distinct linear and sinusoidal predictions for model2 and model3, respectively.

sfc-gh-afero commented 1 month ago

Thank you for reporting this issue, I was able to use your example to reproduce it on my end. We will investigate this issue as a bug.