Iterate on model fitting and evaluation

wrigleyDan commented 3 weeks ago

To try and improve on model performance by the actions identified in #24

[x] Generate more training data: 5000 queries with all nine features set
[x] Add regularization to the model fitting process
[x] Explore the different feature combinations in a detailed way to identify which feature combinations promise the best results

wrigleyDan commented 3 weeks ago

Training models with an 80/20 split of 5000 queries Trying out different feature combinations with cross-validation now shows a reduced spread across the 5 folds of the cross validation runs:

From minimum to maximum the difference does not exceed 0.01.

The best combinations now are either all features (random forest) or all but two features (all except f_2_query_length & f_4_has_special_char)

Comparing the metrics of the 1000 test queries with the dynamic model-driven approach and the global optimization approach:

Metric	Baseline	Linear	Random Forest
DCG	5.96	6.04	6.03
NDCG	0.27	0.27	0.27
Precision	0.30	0.31	0.31

The baseline here was the search pipeline configuration using L2 norm, arithmetic mean as the combination method, a "neuralness" of 0.6, a keyword search weight of 0.4.

All metrics of the dynamic approach are either better or as good as the best pipeline config resulting from the global hybrid search optimizer approach. The linear and the random forest model differ only minimally looking at the aggregated metrics.

Looking at the metrics on the query level reveals different behavior for the models. The following picture shows 25 queries that score better for the linear model

With the gaps in the judgement data we cannot reliably say that the differences in metrics actually come from a difference in search result quality or just from a different amount of judgements available for the result sets of the two models for each query. Manually looking at a few queries reveals that differences in DCG values typically come with differences in the amount of judgements at the top 10 documents.

wrigleyDan commented 3 weeks ago

Regularization

Linear Model:

L2 regularization with regr = Ridge(alpha=1.0)
Best RMSE score: 0.3573320 (compared to 0.3573298 without regularization)
Best feature combination: 'f_1_num_of_terms', 'f_3_has_numbers', 'f_5_num_results', 'f_6_max_title_score', 'f_7_sum_title_scores', 'f_8_max_semantic_score', 'f_9_avg_semantic_score' (unchanged)

Random Forest:

regularization via RandomGridSearchCV (details below)
Best RMSE score: 0.3468 (compared to 0.3597 without regularization)
Best feature combination: 'f_1_num_of_terms', 'f_3_has_numbers', 'f_4_has_special_char', 'f_5_num_results', 'f_6_max_title_score', 'f_7_sum_title_scores', 'f_8_max_semantic_score', 'f_9_avg_semantic_score' (compared to all features)
Best hyperparameter set: 'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 10

RandomGridSearchCV details:

param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

model = RandomForestRegressor(random_state=42)
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=10,  # Number of random samples to draw
    scoring=rmse_scorer,
    cv=cv,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

Training the models with the best feature set identified, the best parameter set identified (Random Forest model) and regularization applied (Linear model) we arrive at very similar metrics like with the approach just taking the best feature combination:

Metric	Baseline	Linear w/o regularization	Random Forest w/o regularization	Linear w/ regularization	Random Forest w/ regularization
DCG	5.96	6.04	6.03	6.03	6.02
NDCG	0.27	0.27	0.27	0.27	0.27
Precision	0.30	0.31	0.31	0.31	0.31

wrigleyDan commented 3 weeks ago

Addition to Regularization

Regularization hurts when applied together with the linear regression model approach. Regularization helps when applied together with the random forest regression model approach. Follow-up: see if this also is the case with smaller datasets (see #4)

Using smaller max_depth for the hyperparameter tuning (=regularization for Random Forest) produced a different combination of parameters and features as the best one: For this parameter distribution to try out

param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [1, 2, 3, 5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

This was the best parameter set and the feature combination to reportedly work best according to the metrics:

Best Feature Combination: 'f_5_num_results', 'f_6_max_title_score', 'f_7_sum_title_scores', 'f_9_avg_semantic_score' Best Parameter Set: {'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5}

RMSE: 0.34714 (compared to 0.3468 above - minor decrease)

wrigleyDan commented 3 weeks ago

Feature combinations The 5 cross validation runs of linear model trainings appear to result in more consistent RMSE scores than the random forest ones. The spread of RMSE scores is narrower:

Linear Model: looking at models with features of one type only

Looking at the distribution of RMSE scores within feature groups shows that neural features only appear to perform worst (green bar) while it's always a mix that performs best (purple and orange bar on the right). Additionally, removing the neural search result features does not look like one is loosing a lot of performance when comparing the orange and the blue bar that look almost exactly the same.

Definition of the legend in the top right corner:

q (blue): query features only
k (red): keyword search result features only
n (green): neural search result features only
kq (purple): keyword search result and query features
knq (orange): keyword search result, neural search result and query features

Random Forest: looking at models with features of one type only

Training random forest models and visualizing the individual cross validation runs shows a similar picture:

neural search result features only provide worst performance (green)
a mix performs best (keyword search result features, neural search result features and query features in purple and keyword search result features and neural search result features in orange)

Looking at the means of the best feature combinations per model type (linear regression and random forest, without regularization applied):

The neural search result features clearly are the feature group with the least spread. This can be explained with only one feature combination being present (as we only have two different neural search result features). Averaging over the best combinations overall shows that the best combinations apparently perform similarly. Both model types perform similarly when looking across the groups, maybe with the exception of the query features only, that seem to work better with random forests than linear regression models. That leads us to the conclusion that it's at the end more a matter of combining the best features, not choosing one model approach over the other. In the end, both approaches perform similarly well while the random forest approach is a lot more complex and slower in training.

wrigleyDan commented 2 weeks ago

Numbers for different runs update on the README of the repository: https://github.com/o19s/opensearch-hybrid-search-optimization

o19s / opensearch-hybrid-search-optimization

Iterate on model fitting and evaluation #1