Thanks for your interest in our work but I am not quite sure what you mean.
If you are asking for an evaluation in which no LLM-as-judge is used: the Open LLM leaderboard is the one without using LLM as the judge.
If you are asking using other non-API models as the Judge: I think you can directly change the API call to a normal inference on the non-Api models. However, I don't think using non-API models as the Judge is widely accepted, as their relatively weak capability.
Is there an evaluation of non API models? Such as LLama 7B, GPT xl, etc.