modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.94k stars 178 forks source link

计算中英文PPL使用的语言模型是什么? #127

Closed pkugyf closed 10 months ago

pkugyf commented 11 months ago

Before Asking 在提问之前

Search before asking 先搜索,再提问

Question

算子:perplexity_filter 看源码,这个算子用的语言模型是在 https://huggingface.co/edugp/kenlm 下载的,但这个模型的介绍页里只说用wiki之类的数据训练的,没说具体用了哪个模型进行训练,只是说一个用例用了西班牙语的bert模型 所以想问一下,用来计算中文和英文的ppl的模型是什么模型?

Additional 额外信息

No response

zhijianma commented 11 months ago

您好,感谢使用Data-Juicer。 主要参考BigScience 处理OSCAR的流程。

github-actions[bot] commented 10 months ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 10 months ago

Close this stale issue.