symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
133 stars 5 forks source link

Unable to run benchmark tasks on windows due to incorrect directory creation syntax #101

Closed mkovelamudi closed 6 months ago

mkovelamudi commented 6 months ago

Unable to run benchmark tasks using "eval-dev-quality evaluate" command on windows due to usage of ":" in the directory creation syntax. Check the attached screenshot of the error.

Steps to reproduce

  1. Install Go, Git
  2. Clone the repository and install application.
  3. Run "eval-dev-quality evaluate --model=openrouter/meta-llama/llama-3-70b-instruct" command.

System Configuration Windows 10 pro

Findings I found that in evaluat.go file (line 67), following format is used for directory creation. command.ResultPath = strings.ReplaceAll(command.ResultPath, "%datetime%", evaluationTimestamp.Format("2006-01-02-15:04:05"))

Usage of ":", I believe is not supported for directory creation in windows.

image

zimmski commented 6 months ago

Thanks @mkovelamudi for the bug report. We only test on Linux but there should be no reason that this doesn't work on Windows. Let me give it a try for a few minutes and get back to you. I do not have a Windows/MacOS machine, so working on non-Linux thingies takes a bit.

zimmski commented 6 months ago

Added the easier OS first: MacOS support https://github.com/symflower/eval-dev-quality/pull/102 which shows that we can port to other OSes. So taking a look now in supporting Windows.

zimmski commented 6 months ago

@mkovelamudi will continue tomorrow. I feared that Windows is more work, and i was right. You can track the progress at https://github.com/symflower/eval-dev-quality/pull/103. Looks like the Symflower cache cannot be loaded for some reason. Need to dig in.

mkovelamudi commented 6 months ago

@zimmski please take your time. I was just trying out and observed the issue. I just wanted to report it.

zimmski commented 6 months ago

@mkovelamudi that was quite a rid of Windows-nonsense to figure out. Please give it a try and let me know how the eval runs on your machine!