rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Baselines: add results for embeddings #80

Closed MinaAlmasi closed 2 weeks ago

MinaAlmasi commented 2 weeks ago

Embeddings

Embedded train, val and test completions using SentenceTransformers for one of our baselines (#75). Results are up and can be seen on the current classify readme on this branch (will be live on main branch after mergin).

I had some model considerations for the choice of encoder - maybe you would like to comment on this before I merge, @rbroc ?

Model Considerations

The model nvidia/NV-Embed-v2 was chosen for embedding the data as it was the highest ranked on MTEB for classification, but might have been a little overkill (the model is 7.85B). I just thought it might be easiest to justify/defend in a paper.

It did give some memory issues initially, but converting to FP16 (lower precision than FP32) and running with a relatively small batch size (16) was fine and took under 2 hours for all 84K ish rows on a single Nvidia L40 GPU (AAU).

rbroc commented 2 weeks ago

If it's done already, absolutely fine to merge. It would have been fine also with a smaller model (e.g., one of the e5, or any of the good old sentence-transformer models) since it's just baselines. If we ever need to re-extract embeddings we can go for that :)

MinaAlmasi commented 2 weeks ago

Alright, I'll make sure to note that we can use a smaller model (e.g., e5) for baseline if we rerun - it is fairly easy to update (just need to add a "passage: " prompt that is not needed with nvidia's model) #75