Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003.
Both for conversations and for scores.
In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).
Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.
TODO
[ ] Add support for gpt-3.5-turbo . Externally respect to LangChain models.
[ ] Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
[ ] Modify langchain-based script for supporting multiple API models and providers.
[ ] Add support for HF models to perform the generation task.
[ ] Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
[ ] Provide multiple template examples for dataset generation.
Description
Currently, chatllama supports the synthetic data generation just from OpenAI’s
davinci-003
. Both for conversations and for scores.In order to avoid huge costs while generating data we should support other API models (as the cheaper
gpt-3.5-turbo
), other API providers and local models (Flan T5 seems a good candidate).Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.
TODO
gpt-3.5-turbo
. Externally respect to LangChain models.