Not sure if this feature belongs to this library or would it require a complete separate library. I am proposing the creation of a library where llm benchmarks can be ran. For example, evaluating a model on HumanEval. Such library would make evaluating LLMs much easier. I liked the dockerized approach they used at https://github.com/NVlabs/verilog-eval to safely evaluate code. Evaluating math and reasoning skill of llm could also be benificial.
Not sure if this feature belongs to this library or would it require a complete separate library. I am proposing the creation of a library where llm benchmarks can be ran. For example, evaluating a model on HumanEval. Such library would make evaluating LLMs much easier. I liked the dockerized approach they used at https://github.com/NVlabs/verilog-eval to safely evaluate code. Evaluating math and reasoning skill of llm could also be benificial.