Add neural speed example

juntao commented 3 weeks ago

Hello, I am a code review bot on flows.network. Here are my reviews of code commits in this PR.

Overall Summary: The GitHub Pull Request titled "Add neural speed example" introduces new functionality efficiently by adding new files, updating dependencies, implementing logic for using the Neural Speed plugin, and updating relevant documentation. However, there are several potential issues and areas for improvement identified in the individual summaries:

Potential Issues and Errors:

The use of hardcoded paths in installation instructions may cause errors on different systems.
Lack of error handling in the Rust code may lead to unexpected behavior.
The Rust code lacks comments and clear variable names for readability and maintainability.
Lack of unit tests to ensure functionality and maintain code quality.
Replacing 'context.fini_single()' with 'graph.unload()' without a clear explanation or documentation of the rationale.
Incomplete comment at the end of the patch file without a meaningful explanation.

Most Important Findings:

Addressing the potential issues with hardcoded paths, error handling, documentation, and testing should be prioritized to improve the overall quality and reliability of the patch.
Properly documenting changes and ensuring alignment with project design and functionality, especially when replacing methods like 'context.fini_single()' with 'graph.unload,' is crucial for clarity and maintainability.
Verifying that backend names are consistently updated throughout the project and confirming the accessibility of the specified model for users are important for accuracy and user experience.
Enhancing the overall code quality with better practices like commenting, clear variable naming, and inclusion of unit tests will contribute to the maintainability and robustness of the project.

Details

Commit 612c2c396653f0911f3ded717016627f41a9b51a

Key Changes:

Added new files: .gitignore, Cargo.toml, README.md, and src/main.rs in the wasmedge-neuralspeed directory.
Updated Cargo.toml with package information and dependencies.
Added a README with instructions on installing WasmEdge with WASI-NN Neural Speed plugin and downloading the model.
Added a Rust source file main.rs implementing logic for using the Neural Speed plugin to perform an inference task with a Neural chat model.

Potential Problems:

The use of hardcoded paths like /usr/local/bin and /usr/local/lib in the installation instructions may not be suitable for all systems and could lead to errors on different setups.
Lack of error handling in the Rust code could result in unexpected behavior if any operations fail.
The Rust code could benefit from more comments and clarity in variable names to improve readability and maintainability.
The patch could include unit tests to ensure the functionality works as expected and to maintain code quality.

Overall, the patch introduces new functionality efficiently but could be improved in terms of error handling and documentation.

Commit 14cb1941e1cde1feecdd7b70f5bd4cca6503e125

Key Changes:

The change in this patch is replacing the 'context.fini_single()' method call with 'graph.unload()' method call in the 'main' function of the 'wasmedge-neuralspeed' Rust project.

Potential Problems:

The method call 'context.fini_single()' has been replaced with 'graph.unload()', but there is no direct explanation provided as to why this change was made. It's important to ensure that this change doesn't introduce any new issues or break existing functionality.
It seems like the comment at the end of the patch file is incomplete (line: graph.unload().expect("Failed to free resource");). It is advisable to provide a meaningful comment explaining the rationale behind this change.

Overall, it's important to review the impact of replacing 'context.fini_single()' with 'graph.unload()' to ensure that it aligns with the project's design and functionality. The changes should also be properly documented for better understanding by other contributors.

Commit 75d677266bbf80900b0038f65de3261908334cca

Key Changes:

The patch updates the backend name in the README file from "OpenVINO" to "Neural Speed."
The model download link specifies the neural-chat-7b-v3-1.Q4_0 model in GGUF format.

Potential Problems:

The patch seems straightforward with no functional issues identified.
It's essential to ensure that the backend name is accurately updated throughout the project to avoid confusion.
Confirm that the neural-chat-7b-v3-1.Q4_0 model is appropriately referenced and accessible for users.
Make sure the changes align with the project's goals and standards.

juntao commented 3 weeks ago

Would love to see a performance comparison of the same model on llama.cpp on Intel CPUs.

grorge123 commented 2 weeks ago

I run a simple test on i7-12700K. model: llama-2-7b-chat.Q4_0 Number of input tokens: 12

llama.cpp: Number of output tokens: 170 load time = 2706.11 ms sample time = 3.59 ms / 170 runs ( 0.02 ms per token, 47393.36 tokens per second) prompt eval time = 478.58 ms / 12 tokens ( 39.88 ms per token, 25.07 tokens per second) eval time = 34458.22 ms / 169 runs ( 203.89 ms per token, 4.90 tokens per second) total time = 37193.88 ms / 181 tokens
Neural Speed: Number of Output tokens: 409 Load time: 5290 ms Compute time: 157988 ms Total time: 163278 ms

juntao commented 2 weeks ago

Does that mean it is actually slower than llama.cpp?

grorge123 commented 2 weeks ago

Yes, the current result shows neural speed is slower than llama.cpp. In addition, directly run neural speed also can not achieve llama.cpp runtime.

hydai commented 2 weeks ago

@grorge123 Could you check how many CPU cores are used in both benchmarks? According to the document, the neural speed should have a better performance on the Intel CPU than other runtimes.

grorge123 commented 2 weeks ago

Neural speed uses all the CPU cores(20).

I update the neural speed vision to 1.0. The new result is Load time: 4989ms Compute time: 96263ms

I test on another computer i7-10700. The other variables are the same.

llama.cpp: Number of output tokens: 512 load time = 1094.42 ms sample time = 13.92 ms / 512 runs ( 0.03 ms per token, 36773.68 tokens per second) prompt eval time = 675.74 ms / 12 tokens ( 56.31 ms per token, 17.76 tokens per second) eval time = 77484.50 ms / 511 runs ( 151.63 ms per token, 6.59 tokens per second) total time = 78824.58 ms / 523 tokens
neural speed Number of Output tokens: 409 Load time: 7923ms Compute time: 59378ms

In this case neural speed has a better performance than llama.cpp. But I have no idea why i7-10700 has better performance than i7-12700K.

second-state / WasmEdge-WASINN-examples