Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003. Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

[ ] Add support for gpt-3.5-turbo . Externally respect to LangChain models.
[ ] Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
[ ] Modify langchain-based script for supporting multiple API models and providers.
[ ] Add support for HF models to perform the generation task.
[ ] Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
[ ] Provide multiple template examples for dataset generation.

nebuly-ai / nebuly

[Chatllama] Add multiple sources for generating synthetic data #221

Description

TODO