languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.49k stars 1.4k forks source link

speed up spelling rules on single words (as they're invoked in IntelliJ) #10883

Closed donnerpeter closed 2 months ago

donnerpeter commented 2 months ago

Quickly discard antipatterns if they're longer than the sentence. In spelling rules, antipatterns always have more than one token, so such a check speeds up getSentenceWithImmunization considerably.

This changes the output of testSeparateCorrectWordPerformance from 12-20 seconds to 2.7 seconds on my machine.

Summary by CodeRabbit

coderabbitai[bot] commented 2 months ago

Walkthrough

The pull request introduces several enhancements across different classes in the languagetool-core module. Key changes include the addition of a method to count non-whitespace tokens in the AnalyzedSentence class, optimizations in the Rule class to improve performance when handling anti-patterns, and the introduction of a minimum token count requirement in the AbstractTokenBasedRule class. Additionally, a new performance test method is added to the HunspellRuleTest class to evaluate the efficiency of the Hunspell rule. Minor adjustments in import statements are also included.

Changes

Files Change Summary
languagetool-core/src/main/java/org/languagetool/AnalyzedSentence.java Added public int getNonWhitespaceTokenCount() method to retrieve the count of non-whitespace tokens.
languagetool-core/src/main/java/org/languagetool/rules/Rule.java Optimized getSentenceWithImmunization method by storing getAntiPatterns() result in a local variable and modified iteration method for accessing anti-patterns.
languagetool-core/src/main/java/org/languagetool/rules/patterns/AbstractTokenBasedRule.java Introduced private final byte minTokenCount to enforce a minimum token requirement and updated canBeIgnoredFor method to check against minTokenCount.
languagetool-language-modules/de/src/test/java/org/languagetool/rules/spelling/hunspell/HunspellRuleTest.java Added public void testSeparateCorrectWordPerformance() for performance testing of the Hunspell rule, annotated with @Ignore.

[!TIP]

OpenAI O1 model for chat - We have deployed OpenAI's latest O1 model for chat. - OpenAI claims that this model has superior reasoning capabilities than their GPT-4o model. - Please share any feedback with us in the [discussions post](https://discord.com/channels/1134356397673414807/1283929536186155099).

Recent review details **Configuration used: CodeRabbit UI** **Review profile: CHILL**
Commits Files that changed from the base of the PR and between c5f2b9b1062925ddaf12a97f82c072f0a51d2e33 and a329580f4e2bab5eef1916d19dbf70888345006e.
Files selected for processing (4) * languagetool-core/src/main/java/org/languagetool/AnalyzedSentence.java (1 hunks) * languagetool-core/src/main/java/org/languagetool/rules/Rule.java (1 hunks) * languagetool-core/src/main/java/org/languagetool/rules/patterns/AbstractTokenBasedRule.java (3 hunks) * languagetool-language-modules/de/src/test/java/org/languagetool/rules/spelling/hunspell/HunspellRuleTest.java (3 hunks)
Additional comments not posted (10)
languagetool-core/src/main/java/org/languagetool/rules/patterns/AbstractTokenBasedRule.java (6)
`42-43`: **LGTM!** The new field `minTokenCount` is properly declared as `private` and `final`. Using `byte` type is a good choice for storing small integer values to optimize memory usage. --- `51-51`: **LGTM!** The initialization logic for `minTokenCount` is updated correctly to check if the first pattern token can match the start of a sentence using the `canMatchSentenceStart` method. This is a valid optimization. --- `54-56`: **LGTM!** Incrementing `minTokenCount` based on the `minOccurrence` of each pattern token is the correct approach. It ensures that `minTokenCount` reflects the minimum number of tokens required for the rule to match, which aligns with the purpose of introducing this field for optimization. --- `83-85`: **LGTM!** Setting the final value of `minTokenCount` by taking the minimum of the calculated `minTokenCount` and `Byte.MAX_VALUE` is the correct approach. It ensures that the value does not exceed the maximum value that can be stored in a `byte`, handling the edge case properly. --- `86-88`: **LGTM!** The new method `canMatchSentenceStart` is implemented correctly. It checks if a pattern token can match the start of a sentence based on the token's properties. Extracting this logic into a separate method improves code readability and reusability. --- `96-96`: **LGTM!** The update to the `canBeIgnoredFor` method to check if the number of non-whitespace tokens in the sentence is less than `minTokenCount` is a valid optimization. It allows quickly discarding sentences that do not have enough tokens to match the rule, aligning with the purpose of introducing `minTokenCount`.
languagetool-core/src/main/java/org/languagetool/AnalyzedSentence.java (1)
`166-172`: **LGTM!** The addition of the `getNonWhitespaceTokenCount()` method is a great optimization. By returning the length of the existing `nonBlankTokens` array, it provides an efficient way to retrieve the count of non-whitespace tokens without creating additional allocations. This approach avoids unnecessary memory usage and improves performance. The `@ApiStatus.Internal` annotation appropriately indicates that this method is intended for internal use within the LanguageTool framework.
languagetool-language-modules/de/src/test/java/org/languagetool/rules/spelling/hunspell/HunspellRuleTest.java (2)
Line range hint `23-36`: **LGTM!** The changes to the import statements are minor and do not affect the functionality of the test class. The static import of assertions is a good practice for readability. --- `243-270`: **Approve the addition of the performance test method.** The new `testSeparateCorrectWordPerformance` method is a valuable addition to the test suite. It provides insights into the efficiency of the Hunspell rule in processing a set of predefined German words. The use of `@Ignore` annotation ensures that the test is not executed during regular test runs, as it is intended for internal performance testing. The method follows a clear structure: 1. Initialize the language tool and retrieve the Hunspell rule. 2. Analyze the predefined words. 3. Measure the time taken to run a loop that checks for rule matches. 4. Print the elapsed time to the console. The test method does not introduce any functional changes to the existing code and serves as a useful benchmark for evaluating the performance of the Hunspell rule.
languagetool-core/src/main/java/org/languagetool/rules/Rule.java (1)
Line range hint `193-208`: **LGTM! The changes optimize the method without altering its core functionality.** The introduction of the `antiPatterns` local variable to store the result of `getAntiPatterns()` is a good optimization. It ensures that the method is called only once, reducing redundant calls and improving performance. The change from a for-each loop to a traditional for loop using an index is also intentional, as indicated by the suppression annotation `//noinspection ForLoopReplaceableByForEach`. This change maintains consistency with the surrounding code and may have performance benefits. The core logic of replacing the `immunizedSentence` remains intact, preserving the method's functionality.
--- Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.