compare
gpt-3.5-turbo
gpt-4-turbo
(if I have the time to look into it) open-source model, llama, mistral
GPT + RAG with some "general materials/chemistry papers"
human baseline??
against each other. perhaps in a bar chart where we can show: "GlossaGen is super good and saving us time AND giving high quality output, this is great". (So the presentation could be: Short intro, short methods, showing some charts on benchmarking, Live Demo". Possibly, this is a moonshot and will take longer than 2 minutes)
For this, we would need some sort of "metric" of how good the output of the model is. We could also take "time to create" into account. Basically a number to state the same as what you wrote about the Dentistry Zeolithes output: "It's a convincing start and only takes 3 seconds to create, but it's flat-out wrong". How could we put this into a measurable metric?
compare gpt-3.5-turbo gpt-4-turbo (if I have the time to look into it) open-source model, llama, mistral GPT + RAG with some "general materials/chemistry papers" human baseline?? against each other. perhaps in a bar chart where we can show: "GlossaGen is super good and saving us time AND giving high quality output, this is great". (So the presentation could be: Short intro, short methods, showing some charts on benchmarking, Live Demo". Possibly, this is a moonshot and will take longer than 2 minutes) For this, we would need some sort of "metric" of how good the output of the model is. We could also take "time to create" into account. Basically a number to state the same as what you wrote about the Dentistry Zeolithes output: "It's a convincing start and only takes 3 seconds to create, but it's flat-out wrong". How could we put this into a measurable metric?