nebuly-ai / nebuly

The user analytics platform for LLMs
https://www.nebuly.com/
Apache License 2.0
8.37k stars 644 forks source link

[Chatllama] Add multiple sources for generating synthetic data #221

Open diegofiori opened 1 year ago

diegofiori commented 1 year ago

Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003. Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

pengwei-iie commented 1 year ago

hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.