languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.47k stars 1.4k forks source link

[pt] Major fix in verbs/nouns detection in disambiguation.xml #10991

Closed marcoagpinto closed 2 weeks ago

marcoagpinto commented 2 weeks ago

Hello!

Another major enhancement and also fixed: “Ele a ama e ela o ama também.” “ele” and “ela” were being detected as a verb, thus incorrectly removing the “ama” verb.

Summary by CodeRabbit

coderabbitai[bot] commented 2 weeks ago

Walkthrough

The changes in this pull request involve modifications to the disambiguation.xml file for Portuguese language rules in LanguageTool. A new rule has been added to the RARE_POS group to handle specific verb forms, along with updates to existing rules to refine patterns and exceptions. Additionally, a commented-out rule group related to sentence tokenization has been removed. These alterations aim to enhance the disambiguation logic for verb forms and improve the overall accuracy of the rules.

Changes

File Path Change Summary
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml - Added a new rule in the RARE_POS group for specific verb forms.
- Updated an existing rule with a new exception pattern.
- Removed a commented-out rule group related to sentence tokenization.
- Minor adjustments to existing rules for improved accuracy.

Possibly related issues

Possibly related PRs

Suggested reviewers


📜 Recent review details **Configuration used: CodeRabbit UI** **Review profile: CHILL**
📥 Commits Reviewing files that changed from the base of the PR and between c0e7e56087f5a4163254dd93404c4d879e66dc1f and e3369c22a7579da91860c80496bceaee15ea5b41.
📒 Files selected for processing (1) * `languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml` (2 hunks)
🔇 Additional comments (2)
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/disambiguation.xml (2) `254-266`: **LGTM! The new rule enhances disambiguation for verb-noun ambiguity.** The rule effectively handles cases where words could be interpreted as either verbs or nouns, particularly in the context of special pronouns. The comment indicates that this has been verified with ChatGPT 4 for 3217 results. Let's verify the impact of this rule: ```shell #!/bin/bash # Description: Verify the new rule's pattern matches and its potential impact # Search for cases where the rule might apply rg -A 5 'postag="V.+" postag_regexp="yes".*postag="NC.+" postag_regexp="yes"' || echo "No matching cases found" # Look for any existing rules that might conflict rg -B 5 -A 5 'action="remove" postag="V\.\+"' || echo "No conflicting rules found" ``` --- `243-243`: **LGTM! The pattern helps prevent false positives in verb detection.** The added pattern correctly excludes common parts of speech that could be mistaken for verbs, improving disambiguation accuracy. Let's verify the impact of this change:
✅ Verification successful The search results show that the pattern is consistent with similar disambiguation rules in Portuguese and other Romance languages (Spanish and Catalan). The pattern helps distinguish between verbs and other parts of speech, particularly in cases where words can have multiple interpretations. Let me verify one more aspect to ensure the exceptions are properly aligned with the Portuguese language's tagging scheme: --- The search results confirm that the pattern and its exceptions are consistent with Portuguese language's tagging scheme. The pattern `` is used in multiple disambiguation rules in the Portuguese module, and the part-of-speech tags (CS, RG, NC, AQ, CC, SPS, D, P) are actively used throughout the file for various disambiguation rules. **The pattern correctly excludes non-verb parts of speech, improving disambiguation accuracy.** - The pattern is well-aligned with Portuguese language's tagging scheme - Similar patterns are used in other rules within the same file - The exceptions cover all relevant non-verb categories that could be ambiguous
🏁 Scripts executed The following scripts were executed for the analysis: Script: ```shell #!/bin/bash # Description: Check for potential cases where the new pattern might affect verb detection # Look for cases where words are tagged with multiple POS including verbs and the excluded categories # Search for words with multiple POS tags including verbs rg -A 2 'postag="[^"]*V[^"]*,[^"]*(?:CS|RG|NC|AQ|CC|SPS|[DP])[^"]*"' || echo "No problematic cases found" ``` Length of output: 10432 --- Script: ```shell #!/bin/bash # Check the tagset used in Portuguese rules and dictionaries rg -A 2 'postag="(?:CS|RG|NC|AQ|CC|SPS|[DP])[^"]*"' languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/ || echo "No matching patterns found" # Also check for any existing similar patterns in Portuguese rules rg -A 2 '

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
🪧 Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit , please review it.` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.