the-crypt-keeper / can-ai-code

Self-evaluating interview for AI coders
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
MIT License
524 stars 30 forks source link

Evaluate WizardLM-2 8x22B #191

Closed krzysiekpodk closed 4 months ago

krzysiekpodk commented 5 months ago

hey,

Not sure if you have seen: https://prollm.toqan.ai/leaderboard

In my opinion the most interesting takeaway is if you filter by advanced in code-recent category (unseen in training) the only model that rival propetiary models with that selection is wizardlm-2 8x22b

It would be really interesting to see if your benchmark will also score it so high

the-crypt-keeper commented 4 months ago

@krzysiekpodk Evaluating a 160B model at FP16 is beyond my available resources as a hobbyist 😢 but I can blow some of this month's cloud GPU budget on trying the 4-bit quants - any preferences?

the-crypt-keeper commented 4 months ago

@krzysiekpodk Results are somewhat mixed:

image

Original Mixtral-Instruct-8x22B performs quite well no matter the quant: AWQ GPTQ and EXL2 are all within spitting distance of each other.

WizardLM2 8x22B AWQ appears to be broken and scores poorly. I could not find a GPTQ. EXL2 performs well.

krzysiekpodk commented 4 months ago

thank you!! This is interesting as it looks like good scores are still not consistent across different benchmarks for oss models, maybe its time for leaderboard of leaderborads? :D