o19s / opensearch-hybrid-search-optimization

This repository is meant to optimize hybrid search settings for OpenSearch. It covers a grid search approach to identify a good parameter set and a model-based approach that dynamically identifies good settings for a query.
2 stars 0 forks source link

Iterate on model fitting and evaluation #1

Closed wrigleyDan closed 2 weeks ago

wrigleyDan commented 3 weeks ago

To try and improve on model performance by the actions identified in #24

wrigleyDan commented 3 weeks ago

Training models with an 80/20 split of 5000 queries Trying out different feature combinations with cross-validation now shows a reduced spread across the 5 folds of the cross validation runs:

Image

From minimum to maximum the difference does not exceed 0.01.

The best combinations now are either all features (random forest) or all but two features (all except f_2_query_length & f_4_has_special_char)

Comparing the metrics of the 1000 test queries with the dynamic model-driven approach and the global optimization approach:

Metric Baseline Linear Random Forest
DCG 5.96 6.04 6.03
NDCG 0.27 0.27 0.27
Precision 0.30 0.31 0.31

The baseline here was the search pipeline configuration using L2 norm, arithmetic mean as the combination method, a "neuralness" of 0.6, a keyword search weight of 0.4.

All metrics of the dynamic approach are either better or as good as the best pipeline config resulting from the global hybrid search optimizer approach. The linear and the random forest model differ only minimally looking at the aggregated metrics.

Looking at the metrics on the query level reveals different behavior for the models. The following picture shows 25 queries that score better for the linear model

Image

With the gaps in the judgement data we cannot reliably say that the differences in metrics actually come from a difference in search result quality or just from a different amount of judgements available for the result sets of the two models for each query. Manually looking at a few queries reveals that differences in DCG values typically come with differences in the amount of judgements at the top 10 documents.

wrigleyDan commented 3 weeks ago

Regularization

Linear Model:

Random Forest:

RandomGridSearchCV details:

param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

model = RandomForestRegressor(random_state=42)
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=10,  # Number of random samples to draw
    scoring=rmse_scorer,
    cv=cv,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

Training the models with the best feature set identified, the best parameter set identified (Random Forest model) and regularization applied (Linear model) we arrive at very similar metrics like with the approach just taking the best feature combination:

Metric Baseline Linear w/o regularization Random Forest w/o regularization Linear w/ regularization Random Forest w/ regularization
DCG 5.96 6.04 6.03 6.03 6.02
NDCG 0.27 0.27 0.27 0.27 0.27
Precision 0.30 0.31 0.31 0.31 0.31
wrigleyDan commented 3 weeks ago

Addition to Regularization

Regularization hurts when applied together with the linear regression model approach. Regularization helps when applied together with the random forest regression model approach. Follow-up: see if this also is the case with smaller datasets (see #4)

Using smaller max_depth for the hyperparameter tuning (=regularization for Random Forest) produced a different combination of parameters and features as the best one: For this parameter distribution to try out

param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [1, 2, 3, 5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

This was the best parameter set and the feature combination to reportedly work best according to the metrics:

Best Feature Combination: 'f_5_num_results', 'f_6_max_title_score', 'f_7_sum_title_scores', 'f_9_avg_semantic_score' Best Parameter Set: {'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5}

RMSE: 0.34714 (compared to 0.3468 above - minor decrease)

wrigleyDan commented 3 weeks ago

Feature combinations The 5 cross validation runs of linear model trainings appear to result in more consistent RMSE scores than the random forest ones. The spread of RMSE scores is narrower: Image

Image

Linear Model: looking at models with features of one type only

Looking at the distribution of RMSE scores within feature groups shows that neural features only appear to perform worst (green bar) while it's always a mix that performs best (purple and orange bar on the right). Additionally, removing the neural search result features does not look like one is loosing a lot of performance when comparing the orange and the blue bar that look almost exactly the same.

Image

Definition of the legend in the top right corner:

Random Forest: looking at models with features of one type only

Training random forest models and visualizing the individual cross validation runs shows a similar picture:

Image

Looking at the means of the best feature combinations per model type (linear regression and random forest, without regularization applied):

Image

Image

The neural search result features clearly are the feature group with the least spread. This can be explained with only one feature combination being present (as we only have two different neural search result features). Averaging over the best combinations overall shows that the best combinations apparently perform similarly. Both model types perform similarly when looking across the groups, maybe with the exception of the query features only, that seem to work better with random forests than linear regression models. That leads us to the conclusion that it's at the end more a matter of combining the best features, not choosing one model approach over the other. In the end, both approaches perform similarly well while the random forest approach is a lot more complex and slower in training.

wrigleyDan commented 2 weeks ago

Numbers for different runs update on the README of the repository: https://github.com/o19s/opensearch-hybrid-search-optimization