Implement SignCLIP Distances metric

sign-language-processing / pose-evaluation

Automatic Evaluation for Pose Files

1 stars 1 forks source link

Implement SignCLIP Distances metric #1

Open cleong110 opened 1 day ago

cleong110 commented 1 day ago

Implement a metric for SignCLIP embedding distances.

As in, take two poses and a SignCLIP model, embed them both, calculate cosine similarity.

Reference implementation with regular CLIP

cleong110 commented 1 day ago

I previously implemented this by saving embeddings to a pgvector database and then doing an Order By. https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee.

cleong110 commented 10 hours ago

There are two major challenges to address for using my implementation for a metric.

generating the SignCLIP embeddings. Right now the process is quite manual and requires me to download the SignCLIP repo and modify some code so I can run it. Once that's done I can save off the resulting vectors as .npy files.
the reliance on pgvector database. It's cumbersome to have to set up a whole postgresql database every time.

cleong110 commented 9 hours ago

The second one is solvable by simply loading in the .npy files from step 1 directly, then calculating cosine distances.

The first one I'm not sure what to do about other than updating the setup scripts and README files

cleong110 commented 5 hours ago

https://github.com/cleong110/pose-evaluation/tree/signclip_metric has a basic implementation, based on loading in .npy files. It can calculate about 5k scores per second.

I have 83116 .npy files per signclip model saved off. math.comb(83116, 2) shows that there are about 3.5 billion combinations, so that's over 100 hours. I might want to fix a few things. Most notably:

don't save the full ID for hyp, ref in a .csv, every letter counts.
Maybe use ProcessPoolExecutor to parallelize? Make better use of all my cpus.

AmitMY commented 4 hours ago

I looked at the implementation. I think you should use a cache, similar to what I have for the CLIP metric. it will be faster to retrieve the embedding from memory than from disk every single time

also, i would argue that you should not implement score but instead implement score_all (again, same as my implementation), that way, you can parallelize all of the scoring by doing large matrix multiplication (and it can happen on GPU so it will be so much faster)

AmitMY commented 4 hours ago

Quick calculation:

SignCLIP embeddings are D=768 dimensions to calculate cosine similarity you need D multiplications + norm multiplication + one division

A modern GPU, like RTX 4090 supports (at 100% efficiency) 82.58 TFLOPS which are 82,580,000,000,000 FLOPS, meaning, it can do about 107,526,041,667 distance calculations per second.

If we have a dataset of 100,000 vectors, and we want to calculate the any-to-any distance, we have 10,000,000,000 calculations to perform, which means that if written optimally it should be able to finish at 0.1 seconds.

AmitMY commented 4 hours ago

And guess what! PyTorch has it built in

# Compute the cosine similarity matrix in a single call
similarities = torch.nn.functional.cosine_similarity(hyp_features,ref_features, dim=-1)

or so says chatgpt https://chatgpt.com/share/67368683-4bb8-800e-8dd6-c37971a1b87c