VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Originally posted by **okankcb** April 22, 2024
Hello,
While conducting unit tests with Titanic data on Verticapy 1.0.1, I noticed a discrepancy between the AIC, BIC, and R2 scores calculated by Verticapy and those calculated using scikit-learn.
**Here are the steps to reproduce the issue:**
**1.Use Titanic data from Verticapy:**
```
from verticapy.datasets import load_titanic
titanic_recette = load_titanic(schema=schema, name='titanic_recette')
titanic_recette.eval("col_1", "1")
titanic_recette.cumsum(column="col_1", name="col_id")
titanic_recette.drop(columns=["col_1"])
df_titanic = titanic_recette.to_pandas()_
```
**2.Train a linear regression model with Verticapy and Python:**
```
from verticapy.machine_learning.vertica import LinearRegression
drop(schema + ".lr_titanic_recette22", method="model")
model = LinearRegression(name=schema + ".lr_titanic_recette22")
model.fit(titanic_recette, ["fare", "age"], "survived")
model.predict(titanic_recette, ["fare", "age"], "survived_pred")
import sklearn.linear_model as lr
df_titanic_copy = df_titanic.copy()
df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["age"].notna()]
df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["fare"].notna()]
reg = lr.LinearRegression().fit(df_titanic_copy[["fare", "age"]], df_titanic_copy["survived"])
df_titanic_copy["survived_pred"] = reg.predict(df_titanic_copy[["fare", "age"]])
df_titanic_copy["survived_pred"] = df_titanic_copy["survived_pred"].round(6)
df_titanic_copy["survived"] = df_titanic_copy["survived"].round(6)
```
**3. Calculate AIC, BIC, and R2 scores with Verticapy:**
```
from verticapy.machine_learning.metrics import aic_score, bic_score
vpy_aic = aic_score("survived", "survived_pred", titanic_recette, k=2)
vpy_bic = bic_score("survived", "survived_pred", titanic_recette, k=2)
vpy_R2score = np.round(r2_score("survived", "survived_pred", titanic_recette), 10)
```
**4. Calculate AIC, BIC, and R2 scores with Scikit-learn:**
```
import sklearn as sk
n = len(df_titanic_copy)
k = 3 # 2 variables + intercept
n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"]))
py_aic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + k * np.log(n)
py_bic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + 2 * k
```
**Results obtained:**
Verticapy AIC Score: -1863.62512466986
Verticapy BIC Score: -1848.30522239793
Verticapy R2 Score: 0.0813242788
Scikit-learn AIC Score: -1488.3492662576539
Scikit-learn BIC Score: -1503.0605080304076
Scikit-learn R2 Score: 0.0783248919
The AIC, BIC, and R2 scores obtained with Verticapy are significantly different from those calculated using scikit-learn.
Verticapy Test 0.12: Scores are identical between Verticapy and scikit-learn.
Please examine this issue and keep me informed of any updates or solutions that may be provided.
Best regards,
Okan.K
Discussed in https://github.com/vertica/VerticaPy/discussions/1207