guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

Give each RuleMatch an explicit priority from Matchers, and order by priority when removing overlapping matches #446

Closed rhystmills closed 1 year ago

rhystmills commented 1 year ago

Currently, style guide (regex) rules are removed in favour of dictionary rules when they overlap.

For example - let's say the text "Brigitte Trogneux" appears in an article, and the following are true:

In this case, we would want "Bridgitte Trogneux" to appear with an orange 'Review' underline because of the style guide rule, but it would currently get a red 'Amend' from the dictionary matcher. This is because we are effectively prioritising by category id in reverse alphabetical order, which hasn't been a big issue until now because clashes between LanguageTool rules and style guide rules were not very common.

We want to make sure that Style Guide rules trump Dictionary rules.

Note: I've removed a test checking that rules are ordered by category, as it's no longer the case, and added a test elsewhere for overlapping rule removal where priority is specified.

What does this change?

This PR adds an explicit priority to CheckerRule instances which can be used to deal with ranges of text that match more than one matcher type. Priority is associated with the rule type - specifically:

LanguageTool rules are prioritised above Style Guide (regex) rules, which are prioritised above Dictionary rules.

A bit of an oddity - Dictionary rules appear as LtRule when they are used after being extracted from the LanguageTool instance that contains them (part of the Dictionary Matcher) so we identify them there via their unique ID.

How to test

  1. Run this project locally alongside Composer, according to the instructions in the repos.
  2. Make sure you have the following line in Composer's config in ~/.gu/flexible-composerbackend.properties in order to use local Typerighter Checker rather than CODE: typerighter.url=https://checker.typerighter.local.dev-gutools.co.uk
  3. Write "bridgitte is a grownup" on its own line and run a check. Does "bridgitte" get a red "Capitalization" category underline from the language tool "Amend" rule? Does "grownup" get an orange "review" underline from its regex "style guide" rule? If so, both rules are overriding collins "Amend" rules.
samanthagottlieb commented 1 year ago

I have run this locally and tested the phrase "bridgitte is a grownup”. "bridgitte" gets a red "Capitalization" category underline and "grownup" gets an orange "review" underline. However “is” also has a red ‘Collins Dictionary’ category underline, with suggested replacements e.g. “BS”, “CIS”, etc. I have found “is” in my local database, and have tested this phrase in CODE, where it doesn’t underline “is”, so I’m not sure what’s going on here! It may be a specific issue with my local db, if you’re not seeing this issue?

I also tested “Brigitte Trogneux”, expecting the full name to have a “Check this” underline, showing a ‘Review’ style guide rule. However, I’m getting a red ‘Collins Dictionary’ underline for “Bridgitte” and an orange Collins underline for “Trogneux” (compared to two red underlined in CODE), so it doesn’t seem to be picking up the style guide rule (I confirmed that this rule is in my local db). Are you seeing something different locally?

rhystmills commented 1 year ago

I have run this locally and tested the phrase "bridgitte is a grownup”. "bridgitte" gets a red "Capitalization" category underline and "grownup" gets an orange "review" underline. However “is” also has a red ‘Collins Dictionary’ category underline, with suggested replacements e.g. “BS”, “CIS”, etc. I have found “is” in my local database, and have tested this phrase in CODE, where it doesn’t underline “is”, so I’m not sure what’s going on here! It may be a specific issue with my local db, if you’re not seeing this issue?

I also tested “Brigitte Trogneux”, expecting the full name to have a “Check this” underline, showing a ‘Review’ style guide rule. However, I’m getting a red ‘Collins Dictionary’ underline for “Bridgitte” and an orange Collins underline for “Trogneux” (compared to two red underlined in CODE), so it doesn’t seem to be picking up the style guide rule (I confirmed that this rule is in my local db). Are you seeing something different locally?

I had this at one point today (on a different branch) but re-running the services resolved everything, think it might be to do with the checker service getting in an odd state, but haven't been able to replicate since. Might be worth testing again as I've made some more changes.

samanthagottlieb commented 1 year ago

Just tested again - bridgitte is a grownup is now giving expected results (capitalization for 'bridgitte' and check this for 'grownup'). However I’m now getting a red ‘Collins Dictionary’ underline for both 'Bridgitte' and for 'Trogneux'. I've double checked I'm running your branch and have pulled your latest changes.

rhystmills commented 1 year ago

I think that's expected behaviour for now, though we don't want it to be once NER (named entity recognition) is sharpened a bit. Bridgitte and Trogneux are not currently in the dictionary so we'd expect a red underline unless the NER catches them. Lower case capitalisation rule has higher priority than dictionary so you don't see that 'typo' highlighted until it's capitalised - that's actually an indication that priority is working as we'd like it to.

(edit: Samantha reminded me that 'Brigitte Trogneux' was a styleguide rule and should have shown up - I had inserted a d into 'Bridgitte' above so it wasn't being triggered)

samanthagottlieb commented 1 year ago

To capture the conversation on chat, 'Brigitte Trogneux' is being caught by the style guide rule, however I'd been testing it with a 'd', i.e. Bridgitte, hence it was being caught by the dictionary. All is working as expected! 🎉