mlc-ai / web-llm

High-performance In-browser LLM Inference Engine
https://webllm.mlc.ai
Apache License 2.0
13.32k stars 852 forks source link

Accuracy of Phi-3.5 Vision Models in Web-LLM is way off #586

Open atlury opened 2 weeks ago

atlury commented 2 weeks ago

I have been testing the accuracy of Phi-3.5 Vision models in Web-LLM https://chat.webllm.ai/#/chat vs Nvidia nims available https://build.nvidia.com/microsoft/phi-3_5-vision-instruct

  1. The Nvidia Nims results (a) Image - square root image Question - describe the image mathematically and solve it Answer - The image shows a hand-drawn representation of a square root symbol with the number 25 underneath it. Mathematically, this can be represented as √25. The square root of 25 is 5, so the solution to the mathematical expression is 5.

(b) Image - x+y=4 image Question - describe the image mathematically and solve it Answer - The image shows a simple algebraic equation and two separate equations. The main equation is 'x + y = 4', which is a linear equation with two variables, x and y. The first sub-equation is 'x = 2', which gives the value of x. The second sub-equation is 'y = ?', which is asking for the value of y. To solve for y, we substitute the value of x into the main equation: 2 + y = 4. By subtracting 2 from both sides, we find that y = 2. Therefore, the solution to the system of equations is x = 2 and y = 2.

(c) Image - x^2+y^2=100 image Question - describe the image mathematically and solve it Answer - The image contains a handwritten equation and a question. The equation is: x^2 + y^2 = 100 The question asks for the value of y when x is given as 5. To solve for y, we substitute x with 5 in the equation: (5)^2 + y^2 = 100 25 + y^2 = 100 Now, we solve for y^2: y^2 = 100 - 25 y^2 = 75 Taking the square root of both sides to solve for y: y = ±√75 y = ±√(25 * 3) y = ±5√3 Therefore, the value of y is ±5√3.

  1. The Web-LLM (fp32) results from https://chat.webllm.ai/#/chat (a) Image - square root image Question - describe the image mathematically and solve it Answer - The image provided does not contain a clear mathematical express to interpret or solve it......

(b) Image - x+y=4 image Question - describe the image mathematically and solve it Answer - The image provided seems to contain a hand-drawn image of a mathematical expression......

(c) Image - x^2+y^2=100 image Question - describe the image mathematically and solve it Answer -

The web-llm fp32 phi-3.5 vision doesn't really give the same results and doesnt solve it. Any ideas?

Attached images sq x y xplusysq

CharlieFRuan commented 2 weeks ago

Thanks! Will take a look this week. Though the baseline you have seems to be an un-quantized version of phi3.5-vision, while WebLLM uses 4bit quantized (hence the code name q4).

atlury commented 2 weeks ago

Yes thanks @CharlieFRuan.

I understand its quantized but will it have such a large difference? I will wait for your results.

TomLucidor commented 1 week ago

Seconding this, are there pre-established results or articles that point to either in favor of Q4 and Q5, or higher quantization like Q6 and Q8?

atlury commented 1 week ago

@TomLucidor I havent seen much benchmarks in terms of accuracy for quantized vs unquantized models especially for the above vision models. I plan to try out the openvino version....

I have played with temperature and top-p parameters and there isnt much improvement in terms of quality output.