Closed MinaAlmasi closed 2 weeks ago
If it's done already, absolutely fine to merge. It would have been fine also with a smaller model (e.g., one of the e5, or any of the good old sentence-transformer models) since it's just baselines. If we ever need to re-extract embeddings we can go for that :)
Alright, I'll make sure to note that we can use a smaller model (e.g., e5) for baseline if we rerun - it is fairly easy to update (just need to add a "passage: " prompt that is not needed with nvidia's model) #75
Embeddings
Embedded train, val and test completions using SentenceTransformers for one of our baselines (#75). Results are up and can be seen on the current classify readme on this branch (will be live on main branch after mergin).
I had some model considerations for the choice of encoder - maybe you would like to comment on this before I merge, @rbroc ?
Model Considerations
The model nvidia/NV-Embed-v2 was chosen for embedding the data as it was the highest ranked on MTEB for classification, but might have been a little overkill (the model is 7.85B). I just thought it might be easiest to justify/defend in a paper.
It did give some memory issues initially, but converting to FP16 (lower precision than FP32) and running with a relatively small batch size (16) was fine and took under 2 hours for all 84K ish rows on a single Nvidia L40 GPU (AAU).