Open mikemccand opened 2 months ago
The tool could also report some aggregate stats, like per-dimension variance, or, do all/some dimensions have negative values, etc.
Capturing a tiny tool I have been using for posterity:
import sys
import numpy as np
def calculate_statistics(file):
np_array = np.fromfile(file, dtype=np.float32)
percentiles = [1, 10, 50, 90, 99, 100]
for percentile in percentiles:
print(percentile, "Percentile = ", np.percentile(np_array, percentile))
print("average: " + str(np.average(np_array)))
print("stddev: " + str(np.std(np_array)))
print("min .. max: " + str(np.min(np_array)) + " .. " + str(np.max(np_array)))
with open(sys.argv[1], "rb") as inp:
calculate_statistics(inp)
Awesome! Let's start with that! I'll go merge it :) Thanks @msokolov
Too often when trying to generate
.vec
files for benchmarking from Cohere I struggled with whether the written files were actually "correct".E.g. early attempts were writing
float64
instead offloat32
and, horribly, if you run with afloat64
encoded.vec
file nothing really "goes wrong", except you get weird/bad recall. Eachfloat64
is interpreted as two (strange) adjacentfloat32
.It'd be nice to have a tool that could just give a bit of transparency about a
.vec
file, e.g. if its size doesn't evenly divide by the dimensions, something is wrong. Or if there are NaN's, something is wrong. Or if the vectors are not normalized to unit sphere when you expected them to be, something is wrong.Maybe the tool could also print out the actual float values for a few vectors and we might use our human eyes to look for any such "anomalies" ...