huggingface / cosmopedia

Apache License 2.0
416 stars 42 forks source link

Educational scoring prompt? #13

Open zidsi opened 5 months ago

zidsi commented 5 months ago

Is the prompt used for content educational scoring part of this repo? Did you use Mixtral to score/classify content or was dedicated classifier trained?

loubnabnl commented 5 months ago

We used Mixtral to score the content of the clusters, you can find the prompt here: https://github.com/huggingface/text-clustering/blob/7815f8b37d91b75cf160ed3f0ec8550c0b58cabb/run_pipeline.py#L12

zidsi commented 5 months ago

Thank you for kind reply. So if I understand correctly the pipeline you let Mixtral classify/score (based on n representative samples - since all wouldn't fit 32k context for large clusters that could emerge in 100k batch!?) clusters created via embeddings. Doesn't it mean that you map/classify the embeddings space (for each identified cluster) which in turn could be used to do such prediction directly based on embeddings? If embedding model used is multilingual, such "destilled" classifier would lower the barrier for many low resourced languages.