qdrant / vector-db-benchmark

Framework for benchmarking vector search engines
https://qdrant.tech/benchmarks/
Apache License 2.0
248 stars 66 forks source link

Need a faster way to visualize the data #104

Open KShivendu opened 4 months ago

KShivendu commented 4 months ago

We have https://github.com/qdrant/vector-db-benchmark/blob/master/scripts/process-benchmarks.ipynb but it only prepares the data.

So web based interactive graphs would be nice. One can use plotly or dash framework.

Please use benchmarks.js as a reference. The logic for filterBestPoints is important to avoid clutter in graph.

It should look like qdrant.tech/benchmarks

aprabhak2 commented 2 months ago

@KShivendu, i was trying to run this benchmark for qdrant-rps-m16-ef128-glove-100-angular, and have the below JSON files in results folder. When i try to use this ipynb notebook, cell 17 gives the following error. Any help would be appreciated.

(vector-db-bench) [aprabh2]$ ls results
qdrant-rps-m-16-ef-128-glove-100-angular-search-0-2024-04-15-15-22-45.json  
qdrant-rps-m-16-ef-128-glove-100-angular-search-3-2024-04-15-15-23-51.json  
qdrant-rps-m-16-ef-128-glove-100-angular-search-6-2024-04-15-15-24-41.json
qdrant-rps-m-16-ef-128-glove-100-angular-search-1-2024-04-15-15-23-04.json  
qdrant-rps-m-16-ef-128-glove-100-angular-search-4-2024-04-15-15-24-08.json  
qdrant-rps-m-16-ef-128-glove-100-angular-search-7-2024-04-15-15-24-58.json
qdrant-rps-m-16-ef-128-glove-100-angular-search-2-2024-04-15-15-23-25.json  
qdrant-rps-m-16-ef-128-glove-100-angular-search-5-2024-04-15-15-24-25.json  
qdrant-rps-m-16-ef-128-glove-100-angular-upload-2024-04-15-15-22-26.json

cell17:

_search = search_df.reset_index()
_upload = upload_df.reset_index()

joined_df = _search.merge(_upload, on=["engine", "m", "ef", "dataset"], how="left", suffixes=("_search", "_upload"))
print(len(joined_df))
joined_df

ERROR:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1302491/1721113676.py in ?()
----> 1 _search = search_df.reset_index()
      2 _upload = upload_df.reset_index()
      3 
      4 joined_df = _search.merge(_upload, on=["engine", "m", "ef", "dataset"], how="left", suffixes=("_search", "_upload"))

/fastdata/01/aprabh2/anaconda3/envs/vector-db-bench/lib/python3.11/site-packages/pandas/util/_decorators.py in ?(*args, **kwargs)
    307                     msg.format(arguments=arguments),
    308                     FutureWarning,
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)

/fastdata/01/aprabh2/anaconda3/envs/vector-db-bench/lib/python3.11/site-packages/pandas/core/frame.py in ?(self, level, drop, inplace, col_level, col_fill)
   5844                     level_values = algorithms.take(
   5845                         level_values, lab, allow_fill=True, fill_value=lev._na_value
   5846                     )
   5847 
-> 5848                 new_obj.insert(0, name, level_values)
   5849 
   5850         new_obj.index = new_index
   5851         if not inplace:

/fastdata/01/aprabh2/anaconda3/envs/vector-db-bench/lib/python3.11/site-packages/pandas/core/frame.py in ?(self, loc, column, value, allow_duplicates)
   4439                 "'self.flags.allows_duplicate_labels' is False."
   4440             )
   4441         if not allow_duplicates and column in self.columns:
   4442             # Should this be a different kind of error??
-> 4443             raise ValueError(f"cannot insert {column}, already exists")
   4444         if not isinstance(loc, int):
   4445             raise TypeError("loc must be int")
   4446 

ValueError: cannot insert dataset, already exists
KShivendu commented 2 months ago

@aprabhak2 We just merged https://github.com/qdrant/vector-db-benchmark/pull/125

It should be fixed now. Please try now and let us know if you face any issues.

aprabhak2 commented 2 months ago

First attempt at adding a plot. This can be added as a new cell to the end of the Notebook: https://github.com/qdrant/vector-db-benchmark/blob/master/scripts/process-benchmarks.ipynb

import json
import matplotlib.pyplot as plt

with open('results.json') as json_data:
    all_data = json.load(json_data)
    json_data.close()

xaxis="mean_precisions"
yaxis="rps"
dataset_name="glove-100-angular"
parallel=100.0
lower_is_better=False

engine_name_to_xy = {}

xpoints = []
ypoints = []
for curr_data in all_data:
    engine_name = curr_data['engine_name']
    if curr_data['dataset_name'] != dataset_name or curr_data['parallel'] != parallel:
        continue
    if engine_name not in engine_name_to_xy:
        engine_name_to_xy[engine_name]=[]
    engine_name_to_xy[engine_name].append((curr_data[xaxis],curr_data[yaxis]))

def check_better(x,y,lower_is_better):
    return lower_is_better if x < y else not lower_is_better

all_plts=[]
for engine_name, curr_xy_pts in engine_name_to_xy.items():
    curr_xy_pts.sort(key=lambda tup: tup[0], reverse=True)
    curr_x_pts=[]
    curr_y_pts=[]
    for idx, (x,y) in enumerate(curr_xy_pts):
        if idx == 0 or check_better(y,curr_y_pts[-1],lower_is_better):
            curr_y_pts.append(y)
            curr_x_pts.append(x)
    all_plts.append(plt.plot(curr_x_pts,curr_y_pts,label=engine_name,marker = 'o'))

plt.legend(loc="upper right")
plt.xlabel(xaxis)
plt.ylabel(yaxis)
plt.show()
KShivendu commented 1 month ago

@filipecosta90 would you be interested to pick this up :)

If yes, would be nice if we can do it with plotly to build interactive graphs.

filipecosta90 commented 1 month ago

@filipecosta90 would you be interested to pick this up :)

If yes, would be nice if we can do it with plotly to build interactive graphs.

sure. let me pick this one. should be able to devote some time to it at end of week