mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
203 stars 114 forks source link

Create a simple vector inspector tool #298

Open mikemccand opened 2 months ago

mikemccand commented 2 months ago

Too often when trying to generate .vec files for benchmarking from Cohere I struggled with whether the written files were actually "correct".

E.g. early attempts were writing float64 instead of float32 and, horribly, if you run with a float64 encoded .vec file nothing really "goes wrong", except you get weird/bad recall. Each float64 is interpreted as two (strange) adjacent float32.

It'd be nice to have a tool that could just give a bit of transparency about a .vec file, e.g. if its size doesn't evenly divide by the dimensions, something is wrong. Or if there are NaN's, something is wrong. Or if the vectors are not normalized to unit sphere when you expected them to be, something is wrong.

Maybe the tool could also print out the actual float values for a few vectors and we might use our human eyes to look for any such "anomalies" ...

mikemccand commented 2 months ago

The tool could also report some aggregate stats, like per-dimension variance, or, do all/some dimensions have negative values, etc.

msokolov commented 2 months ago

Capturing a tiny tool I have been using for posterity:

import sys
import numpy as np

def calculate_statistics(file):
    np_array = np.fromfile(file, dtype=np.float32)
    percentiles = [1, 10, 50, 90, 99, 100]
    for percentile in percentiles:
        print(percentile, "Percentile = ",  np.percentile(np_array, percentile))
    print("average: " + str(np.average(np_array)))
    print("stddev: " + str(np.std(np_array)))
    print("min .. max: " + str(np.min(np_array)) + " .. " + str(np.max(np_array)))

with open(sys.argv[1], "rb") as inp:
    calculate_statistics(inp)
mikemccand commented 2 months ago

Awesome! Let's start with that! I'll go merge it :) Thanks @msokolov