nebuly-ai / optimate

A collection of libraries to optimise AI model performances

Apache License 2.0

8.37k stars 643 forks source link

Description

The first huge difficulty for training an AI assistant is to get a dataset reach enough and big enough for starting the training at all.

ChatLLaMA needs three different type of data:

Instruction + human label for supervised fine-tuning of the Agent

Text example + human evaluation (score) for training the reward model

Unlabeled instructions to be used in RLHF

In case of a ChatBot the Instruction should contain

the Prompt for the model, describing the task it should perform

Previous chat interactions

User command

Given a few examples from the user we would like to generate synthetic data, which should be “aligned” with the user data.

TODO

[ ] Implement a function for analysing user data and produce the dataset needed for the Agent training

[ ] Implement a data-generator for the reward model taking as input the “Rules” to be used in the scoring functions. Rules must be written in a single txt-like file.

[ ] Integrate generated datasets with available open-source datasets.

[ ] Write unittest for the data-generation function

`def analyze_user_data(user_data):

Define the columns of the dataset

columns = ['age', 'gender', 'location', 'interests', 'purchase_history', 'intent']

# Initialize an empty list to store the data
dataset = []

# Loop through each user in the data
for user in user_data:
    # Extract relevant information from the user data
    age = user['age']
    gender = user['gender']
    location = user['location']
    interests = user['interests']
    purchase_history = user['purchase_history']
    intent = user['intent']

    # Create a new row for the dataset
    row = [age, gender, location, interests, purchase_history, intent]

    # Append the row to the dataset
    dataset.append(row)

# Return the dataset as a pandas DataFrame
return pd.DataFrame(dataset, columns=columns)

This function i created takes a list of user data as input and analyzes each user's information to create a dataset for agent training. The columns of the dataset are defined in the columns variable, and an empty list called dataset is initialized to store the data.

The function loops through each user in the user_data list and extracts relevant information such as age, gender, location, interests, purchase history, and intent. A new row is created for each user, and the row is appended to the dataset list.

Finally, the function returns the dataset list as a pandas DataFrame with the columns defined in the columns variable.

nebuly-ai / optimate

Generate custom dataset from few user samples #219

Description

TODO

Define the columns of the dataset

This function i created takes a list of user data as input and analyzes each user's information to create a dataset for agent training. The columns of the dataset are defined in the columns variable, and an empty list called dataset is initialized to store the data.