Open zidsi opened 7 months ago
We used Mixtral to score the content of the clusters, you can find the prompt here: https://github.com/huggingface/text-clustering/blob/7815f8b37d91b75cf160ed3f0ec8550c0b58cabb/run_pipeline.py#L12
Thank you for kind reply. So if I understand correctly the pipeline you let Mixtral classify/score (based on n representative samples - since all wouldn't fit 32k context for large clusters that could emerge in 100k batch!?) clusters created via embeddings. Doesn't it mean that you map/classify the embeddings space (for each identified cluster) which in turn could be used to do such prediction directly based on embeddings? If embedding model used is multilingual, such "destilled" classifier would lower the barrier for many low resourced languages.
Is the prompt used for content educational scoring part of this repo? Did you use Mixtral to score/classify content or was dedicated classifier trained?