Open FutureGoose opened 11 months ago
On page 57, we encounter a similar problem where the model is being trained on both validation and training data:
7.5 Training the Number of Trees in the Forest
from yellowbrick.model_selection import validation_curve
fig, ax = plt.subplots(figsize=(10,4))
viz = validation_curve(xgb.XGBClassifier(random_state=42),
x=pd.concat([X_train, X_test], axis='index'),
y=np.concatenate([y_train, y_test]),
param_name='n_estimators', param_range=range(1, 100, 2),
scoring='accuracy', cv=3,
ax=ax)
rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29) rf_xg29.fit(X_train, y_train) rf_xg29.score(X_test, y_test) 0.7480662983425415
EDIT: There's also some typos here: x=pd.concat([X_train, X_test], axis='index'),
Hi š Matt Harrison,
I'm thoroughly enjoying your book on XGBoost, but I noticed what might be data leakage during hyperparameter tuning. Specifically, on page 46, 47, 48 and 49, it seems both training and test data are used for model fitting.
If this approach is intentional, could you please clarify the rationale? I'd greatly appreciate your insights.
Thank you again for the excellent book.
Example from p. 47 with my inline comments: