Open cleong110 opened 1 day ago
I previously implemented this by saving embeddings to a pgvector database and then doing an Order By. https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee.
There are two major challenges to address for using my implementation for a metric.
The second one is solvable by simply loading in the .npy files from step 1 directly, then calculating cosine distances.
The first one I'm not sure what to do about other than updating the setup scripts and README files
https://github.com/cleong110/pose-evaluation/tree/signclip_metric has a basic implementation, based on loading in .npy files. It can calculate about 5k scores per second.
I have 83116 .npy files per signclip model saved off. math.comb(83116, 2) shows that there are about 3.5 billion combinations, so that's over 100 hours. I might want to fix a few things. Most notably:
I looked at the implementation. I think you should use a cache, similar to what I have for the CLIP metric. it will be faster to retrieve the embedding from memory than from disk every single time
also, i would argue that you should not implement score
but instead implement score_all
(again, same as my implementation), that way, you can parallelize all of the scoring by doing large matrix multiplication (and it can happen on GPU so it will be so much faster)
SignCLIP embeddings are D=768
dimensions
to calculate cosine similarity you need D multiplications + norm multiplication + one division
A modern GPU, like RTX 4090 supports (at 100% efficiency) 82.58 TFLOPS which are 82,580,000,000,000 FLOPS, meaning, it can do about 107,526,041,667 distance calculations per second.
If we have a dataset of 100,000 vectors, and we want to calculate the any-to-any distance, we have 10,000,000,000 calculations to perform, which means that if written optimally it should be able to finish at 0.1 seconds.
And guess what! PyTorch has it built in
# Compute the cosine similarity matrix in a single call
similarities = torch.nn.functional.cosine_similarity(hyp_features,ref_features, dim=-1)
or so says chatgpt https://chatgpt.com/share/67368683-4bb8-800e-8dd6-c37971a1b87c
Implement a metric for SignCLIP embedding distances.
As in, take two poses and a SignCLIP model, embed them both, calculate cosine similarity.
Reference implementation with regular CLIP