opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[Enhancement] Introduce C++ formatter in JNI. #2250

Open 0ctopus13prime opened 2 weeks ago

0ctopus13prime commented 2 weeks ago

Description

Currently, we don’t have a formatting tool enabled for C++ code, and as a result, the codebase has a mix of different coding styles. In contrast, spotless is already enabled for Java, preventing builds if code does not follow the established format. I propose adopting a similar approach for our C++ (JNI) code.

Establishing a consistent code style for C++ in our repository would enhance readability and facilitate future collaboration. Allowing multiple styles to coexist can lead to the "Broken Windows Effect," where visible inconsistencies invite further deviations, ultimately contributing to a less organized and harder-to-maintain codebase.

Tools

After researching several options, I propose setting up clang-format, which is compatible across major platforms—Linux, macOS, and Windows. The tool is also easy to integrate with popular IDEs like CLion, VSCode, and others, making clang-format a straightforward choice for our needs.

Note : We can also consider to use spotless - https://github.com/diffplug/spotless

Code Style

The key principle is to adopt a unified and widely accepted code style that maintainers prefer. Naturally, auto-generated JNI method signatures will remain unchanged, even after introducing the formatter, as the method naming pattern is a contract that the JVM relies on to locate the methods it calls.

I want to put it on a vote:

Required Changes

General idea is to make Makefile generated by CMake to call clang-format to format C++ codes. I prefer this approach since I like a system to take care of choirs by itself, but open to other approaches.

Or, alternatively we can make it reject to build if it found there are misformatted files similar to what we are doing in Java. Once it's rejected, user must run formatting manually before building.

Expected Hassle

Since it requires clang-format, we need to overhaul all the pipeline, CI infra had it already to avoid program not found exception during building.

So I'm proposing 2 stages of introduction:

1st phase : Introduce formatting, but run it only if the program exists.

find_program(CLANG_FORMAT_EXECUTABLE clang-format)
if(CLANG_FORMAT_EXECUTABLE)
   ...
endif()

2nd phase : After make sure that pipelines, CI and other major third party won't have issue with formatting, we can enforce the formatting.

Merge conflicts

Since it is expected to bring changes most files, all changes before formatting in dev branch will likely hit merge conflicts.

jmazanec15 commented 2 weeks ago

I want to put it on a vote:

I typically refer to google's style guide. But I dont have a strong preference. Do you have one you like in particularly?

Since it requires clang-format, we need to overhaul all the pipeline, CI infra had it already to avoid program not found exception during building.

It seems spotless supports it: https://github.com/diffplug/spotless. Would we be able to leverage gradle to do this so we dont have to worry about dependencies?

Since it is expected to bring changes most files, all changes before formatting in dev branch will likely hit merge conflicts.

This is unfortunate but a one time cost. I think its okay.

0ctopus13prime commented 2 weeks ago

I typically refer to google's style guide. But I dont have a strong preference. Do you have one you like in particularly?

I really don't care haha. But, the google style is the most comfortable for me.

0ctopus13prime commented 2 weeks ago

It seems spotless supports it: https://github.com/diffplug/spotless. Would we be able to leverage gradle to do this so we dont have to worry about dependencies?

Nice!! then we can decouple formatting from CMake entirely ONLY IF it can format as much as clang-format. I think it would be easier to modify gradle file than doing it in CMake. Hoping that spotless can support as much as clang-format does. I will attach the comparisons between two.

0ctopus13prime commented 2 weeks ago

Looks like spotless internally calls clang-format? - https://github.com/diffplug/spotless/blob/main/plugin-gradle/README.md#cc

spotless {
  cpp {
    target 'src/native/**' // you have to set the target manually

    clangFormat()  // has its own section below
    eclipseCdt()   // has its own section below

    licenseHeader '/* (C) $YEAR */' // or licenseHeaderFile
  }
}
0ctopus13prime commented 2 weeks ago

I think we can also consider eclipse CDT formatting with Google code style : https://github.com/google/styleguide/blob/gh-pages/eclipse-cpp-google-style.xml We can import the above XML into IDE (Clion or VSCode), and the good thing is that the formatter downloading will be taken care of by gradle, we don't actually need to care much about the library dependency. Also we don't need to modify CMake which is bit trickier to edit compared to gradle.

In build.gradle file

spotless {
  cpp {
    target('jni/include/**', 'jni/src/**')
    eclipseCdt('11.6').configFile('/tmp/eclipse-cpp-google-style.xml')
  }
}

Result:

-jlong knn_jni::commons::storeVectorData(knn_jni::JNIUtilInterface *jniUtil, JNIEnv *env, jlong memoryAddressJ,
-                                        jobjectArray dataJ, jlong initialCapacityJ, jboolean appendJ) {
-    std::vector<float> *vect;
-    if ((long) memoryAddressJ == 0) {

+jlong knn_jni::commons::storeVectorData(knn_jni::JNIUtilInterface *jniUtil,
+                                        JNIEnv *env, jlong memoryAddressJ,
+                                        jobjectArray dataJ,
+                                        jlong initialCapacityJ,
+                                        jboolean appendJ) {
+  std::vector<float> *vect;
+  if ((long) memoryAddressJ == 0) {
jmazanec15 commented 2 weeks ago

@0ctopus13prime did you get a poc working? If so Im good with google guide and eclipse

0ctopus13prime commented 2 weeks ago

Sure, let me raise a PR shortly. I'm planning to start working on this after bumping up JNI C++ version to 17.

0ctopus13prime commented 2 weeks ago

@jmazanec15 Hey Jack, everything seems to be working fine, just doing a last check.

FAISS uses snake_case for method names (NMSLIB uses camelCase, but that's going to be deprecated in the future anyway). The Google C++ style guide uses PascalCase for method names (e.g., Analyze(int foo)), but do you think we should align with FAISS and switch to snake_case (e.g., range_search(idx_t n, ...))? We could still follow other parts of the Google C++ style, just changing the method name style to snake_case.

// Google C++ code style

class MyClass {
 public:
  void Analyze(const std::string &text);
  void Analyze(const char *text, size_t textlen);
};
// Faiss
struct IndexHNSW : Index {
    typedef HNSW::storage_idx_t storage_idx_t;
    void range_search(
            idx_t n,
            const float* x,
            float radius,
            RangeSearchResult* result,
            const SearchParameters* params = nullptr) const override;
jmazanec15 commented 2 weeks ago

In general, I think PascalCase is fine. I dont think we need to align with any libraries. We just need to make sure our conventions are consistent.

0ctopus13prime commented 2 weeks ago

@jmazanec15 Sounds good, once C++ 17 update PR is merged will raise a new PR for this shortly

0ctopus13prime commented 1 week ago

Enabled C++17 in JNI, will continue working on this.

0ctopus13prime commented 1 week ago

Eclipse CDT mostly works but it's not perfect. Seems struggling when parsing template <. It keeps recognizing it as a comparison sign.

std::unique < XXX
  > xxx = ...

Expected
std::unique<XXX> xxx = ...

I'm inclining toward clang-format + clang-tidy options which rely on clang lexer, providing more accurate linting and static analysis.
Will evaluate how easy to automate installing them in either build.gradle or CMAKE. They are very standard in C++ world, it should be straightforward to install them in most popular platforms - Mac, Linux and Windows

0ctopus13prime commented 1 week ago

After running clang-tidy with Google code style on our JNI codes, I got around 330 warnings (excluding faiss and nmslib). I’ll clean these up in a PR soon. Also, it looks pretty manageable to make a cross-platform install script. I’ll test it on the major platforms first then include it in the PR once it’s ready.

jmazanec15 commented 1 week ago

sounds good! So, would gradle not auto-configure dependency?

0ctopus13prime commented 1 week ago

@jmazanec15 Not for clang-format. It just calls it to format the code, but I will not download it unfortunately. :stuck_out_tongue: But it does for eclipse CDT. Another bad luck is that it can't really do the formatting the template code... Will share a comprehensive PR having an install script for clang-format + tidy as well as the result code after applied formatting shortly!!