haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
9 stars 2 forks source link

Benchmark against bigger open models #59

Open dcfidalgo opened 2 months ago

dcfidalgo commented 2 months ago

Hi @haesleinhuepf , I work at the Max Planck Computing and Data Facility and closely collaborate with @nscherf group. I was wondering if you would be interested in running the benchmark against some bigger open models like "command-r" or "llama3-70B", they beat some versions of GPT4 in the chatbot arena. I could set up these models here at our HPC systems, let me know what you think. Have a great day! David

haesleinhuepf commented 2 months ago

Hi David @dcfidalgo ,

great initiative! I'd love to learn more about larger LLMs, and also the mid-sized models, I couldn't benchmark myself. Let me also get Jean-Karim @jkh1 in the loop who had similar thoughts.

Is there anything we could do to make your life easier setting things up at MPCDF?

Thanks!

Best, Robert

jkh1 commented 2 months ago

@dcfidalgo I am planning to produce samples for the following:

I already have samples for some, others are still running. I am happy to run others if they fit (I am running on Kubernetes with 4 Tesla-P40 GPUs and 128 GB RAM).

EDIT: Forgot to say I already have installed comand-r-plus:104b_q4 but wasn't planning to include it in the tests but can do so.

dcfidalgo commented 2 months ago

@jkh1 Nice, these are the models I also had in mind. If you need computing power or further support from our side, let me know. I'm really curious about the results compared to the closed models :)

nscherf commented 2 months ago

Sounds exciting @jkh1 and @dcfidalgo !

haesleinhuepf commented 2 months ago

Hi all,

just out of curiousity, a colleague pointed me at FastChat. Is anyone of you using it?

jkh1 commented 2 months ago

No I wasn't aware of it. It seems that it's what's powering blablador. At first glance, it looks a bit light on documentation and possibly more cumbersome to install models compared to ollama.