nerdslab-club / cl_data

Curious learner sample datasets.
GNU Affero General Public License v3.0
1 stars 0 forks source link

Curious Learner Data

In this repository we have created training and testing data for curious learner generative model. As curious learner is a generative next word prediction model, so each question or prompt may have multiple answers equally correct. That's why we created the data in such a way that we can generate random answer between the equally correct answers each time. And we can generate as many example as we want.

Table of Contents

Project Overview

As making a generative LLM is a mammoth task need huge computational power and engineering effort, so to prove our models credibility we have to scope the data on which the model will be trained and tested. We have identified four task which can scope the training and testing data as well as our computational need. These are,

  1. Function to Function Translation Dataset
  2. Function to Natural Language Translation Dataset
  3. Natural Language to Function Translation Dataset
  4. Natural Language & Function to Natural Language & Function Translation Dataset

Initially we also thought of providing a generalized LLM which can use our 98 functions, but quickly we realized the complexity & the scale of that task. And after detail analysis we discard the idea. For that we have create some data those are,

  1. Masked Token Dataset
  2. Next Token Dataset

But they aren't used anywhere.

Folder Structure

Briefly describing the purpose of each major folder in your project.

IO Parser takes a string and generate the IO parser tuple with token and category map, The first item of the tuple is token and the second item is category map. For example,

input = "##division(4.5,2)"
output = [
   (
     "<function MathFunctions.division at 0x117b828c0>",
     {
        "type":"function",
        "subType":"float",
        "subSubType":"execute"
     }
   ),
   (
     4.5,
     {
        "type":"float",
        "subType":"default",
        "subSubType":"param_one"
     }
   ),
   (
     2,
     {
        "type":"integer",
        "subType":"default",
        "subSubType":"param_last"
     },
   )
]

License

This repository is licensed under the GNU Affero General Public License - see the LICENSE.md file for details.

Testing

demonstration.ipynb contain all the test code with proper value assertion, please go through that, to understand the code and data conversion.