nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 643 forks source link

Generate custom dataset from few user samples #219

Open diegofiori opened 1 year ago

diegofiori commented 1 year ago

Description

The first huge difficulty for training an AI assistant is to get a dataset reach enough and big enough for starting the training at all.

ChatLLaMA needs three different type of data:

In case of a ChatBot the Instruction should contain

Given a few examples from the user we would like to generate synthetic data, which should be “aligned” with the user data.

TODO

robertmalisa commented 1 year ago

`def analyze_user_data(user_data):

Define the columns of the dataset

columns = ['age', 'gender', 'location', 'interests', 'purchase_history', 'intent']

# Initialize an empty list to store the data
dataset = []

# Loop through each user in the data
for user in user_data:
    # Extract relevant information from the user data
    age = user['age']
    gender = user['gender']
    location = user['location']
    interests = user['interests']
    purchase_history = user['purchase_history']
    intent = user['intent']

    # Create a new row for the dataset
    row = [age, gender, location, interests, purchase_history, intent]

    # Append the row to the dataset
    dataset.append(row)

# Return the dataset as a pandas DataFrame
return pd.DataFrame(dataset, columns=columns)

This function i created takes a list of user data as input and analyzes each user's information to create a dataset for agent training. The columns of the dataset are defined in the columns variable, and an empty list called dataset is initialized to store the data.

The function loops through each user in the user_data list and extracts relevant information such as age, gender, location, interests, purchase history, and intent. A new row is created for each user, and the row is appended to the dataset list.

Finally, the function returns the dataset list as a pandas DataFrame with the columns defined in the columns variable.