feat(pipelines/benchmarking): create a benchmarking pipeline

compare gpt-3.5-turbo gpt-4-turbo (if I have the time to look into it) open-source model, llama, mistral GPT + RAG with some "general materials/chemistry papers" human baseline?? against each other. perhaps in a bar chart where we can show: "GlossaGen is super good and saving us time AND giving high quality output, this is great". (So the presentation could be: Short intro, short methods, showing some charts on benchmarking, Live Demo". Possibly, this is a moonshot and will take longer than 2 minutes) For this, we would need some sort of "metric" of how good the output of the model is. We could also take "time to create" into account. Basically a number to state the same as what you wrote about the Dentistry Zeolithes output: "It's a convincing start and only takes 3 seconds to create, but it's flat-out wrong". How could we put this into a measurable metric?

mlederbauer / glossagen

feat(pipelines/benchmarking): create a benchmarking pipeline #5