Base class for code modularity and less redundancy

Benchmark Base class

This PR solves issue #91 where specifically we add the following features

A Base class

A base benchmark class with minimum set of granular level of functions required so that we can extend it to other class (tested for autoawq for now)

Addition of newer model

This PR also adds newer model like llama-2 chat and mistral-instruct and removes the old one

Interface for empirical quality checks

This PR also tries to change the folder structure with this kind of structure:

Logs/
├── llama
│   └── AutoAWQ-2024-04-09 19:10:10.112947
│       ├── performance.log
│       └── quality_check.json
└── mistral
    └── AutoAWQ-2024-04-09 19:12:20.052184
        ├── performance.log
        └── quality_check.json

Such that for each model specific benchmark a new folder under that model name is created and we have performance.log file which keeps track of memory consumption and latency. And a quality_check.json that contains JSON object of the following format:

{
    "int4": [
        {
            "question": {
                "prompt": "I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List the main reason. Answer as much precise as possible with one sentence.",
                "max_tokens": 512,
                "temperature": 0.1,
                "expected": "baking soda is missing. a leavening agent should be added"
            },
            "max_token": 512,
            "temperature": 0.1,
            "actual": "The main reason for the terrible pancakes could be that the batter was not properly mixed, resulting in lumps and an uneven consistency.",
            "expected": "baking soda is missing. a leavening agent should be added"
        },
        {
            "question": {
                "prompt": "42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch? Answer as much precise as possible with one sentence.",
                "max_tokens": 512,
                "temperature": 0.1,
                "expected": "0. all the birds flew away after the first shot"
            },
            "max_token": 512,
            "temperature": 0.1,
            "actual": "One bird is left on the branch.",
            "expected": "0. all the birds flew away after the first shot"
        },

    ]
}

The above example is for AutoAWQ benchmark (which contains only int4).

CLI and Report making

This PR adds an utils function which adds a function to launch CLI args and also a function to make report from the benchmarks.

premAI-io / benchmarks