vertica / VerticaPy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
https://www.vertica.com/python/
Apache License 2.0
218 stars 44 forks source link

Verticapy 1.0.1 - Unit test failure for three regression metrics (Aic_score, Bic_score, R2_score) #1236

Open oualib opened 1 month ago

oualib commented 1 month ago

Discussed in https://github.com/vertica/VerticaPy/discussions/1207

Originally posted by **okankcb** April 22, 2024 Hello, While conducting unit tests with Titanic data on Verticapy 1.0.1, I noticed a discrepancy between the AIC, BIC, and R2 scores calculated by Verticapy and those calculated using scikit-learn. **Here are the steps to reproduce the issue:** **1.Use Titanic data from Verticapy:** ``` from verticapy.datasets import load_titanic titanic_recette = load_titanic(schema=schema, name='titanic_recette') titanic_recette.eval("col_1", "1") titanic_recette.cumsum(column="col_1", name="col_id") titanic_recette.drop(columns=["col_1"]) df_titanic = titanic_recette.to_pandas()_ ``` **2.Train a linear regression model with Verticapy and Python:** ``` from verticapy.machine_learning.vertica import LinearRegression drop(schema + ".lr_titanic_recette22", method="model") model = LinearRegression(name=schema + ".lr_titanic_recette22") model.fit(titanic_recette, ["fare", "age"], "survived") model.predict(titanic_recette, ["fare", "age"], "survived_pred") import sklearn.linear_model as lr df_titanic_copy = df_titanic.copy() df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["age"].notna()] df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["fare"].notna()] reg = lr.LinearRegression().fit(df_titanic_copy[["fare", "age"]], df_titanic_copy["survived"]) df_titanic_copy["survived_pred"] = reg.predict(df_titanic_copy[["fare", "age"]]) df_titanic_copy["survived_pred"] = df_titanic_copy["survived_pred"].round(6) df_titanic_copy["survived"] = df_titanic_copy["survived"].round(6) ``` **3. Calculate AIC, BIC, and R2 scores with Verticapy:** ``` from verticapy.machine_learning.metrics import aic_score, bic_score vpy_aic = aic_score("survived", "survived_pred", titanic_recette, k=2) vpy_bic = bic_score("survived", "survived_pred", titanic_recette, k=2) vpy_R2score = np.round(r2_score("survived", "survived_pred", titanic_recette), 10) ``` **4. Calculate AIC, BIC, and R2 scores with Scikit-learn:** ``` import sklearn as sk n = len(df_titanic_copy) k = 3 # 2 variables + intercept n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) py_aic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + k * np.log(n) py_bic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + 2 * k ``` **Results obtained:** Verticapy AIC Score: -1863.62512466986 Verticapy BIC Score: -1848.30522239793 Verticapy R2 Score: 0.0813242788 Scikit-learn AIC Score: -1488.3492662576539 Scikit-learn BIC Score: -1503.0605080304076 Scikit-learn R2 Score: 0.0783248919 The AIC, BIC, and R2 scores obtained with Verticapy are significantly different from those calculated using scikit-learn. Verticapy Test 0.12: Scores are identical between Verticapy and scikit-learn. Please examine this issue and keep me informed of any updates or solutions that may be provided. Best regards, Okan.K