ota-meshi / eslint-plugin-regexp

ESLint plugin for finding regex mistakes and style guide violations.
https://ota-meshi.github.io/eslint-plugin-regexp/
MIT License
696 stars 10 forks source link

Add `regexp/unicode-property` rule #722

Closed RunDevelopment closed 6 months ago

RunDevelopment commented 7 months ago

Fixes #720

This PR adds a new rule that allows users to enforce the naming of Unicode properties. It has 3 main features:

  1. Removing/adding gc=/General_Category= keys, e.g. \p{gc=L} -> \p{L}. These prefixes are unnecessary, because the values of the General_Category property can be accessed without the key.
  2. Enforcing long or short keys for General_Category/gc, Script/sc, and Script_Extensions/scx.
  3. Enforcing long or short names of values and binary properties. E.g. \p{L} -> \p{Letter} and \p{Hex} -> \p{Hex_Digit}.

All of these feature can be individually configured and turned off by the user. The regexp/unicode-property is not included in our recommended config, because this rule only enforces a specific style.

Default configuration

The default configuration is the following:

{
    "generalCategory": "never",
    "key": "ignore",
    "property": {
        "binary": "ignore",
        "generalCategory": "ignore",
        "script": "long",
    }
}

This means that, by default, the rule will (1) remove General_Category/gc keys (e.g. \p{gc=L} -> \p{L}) and (2) enforce long names for values of the Script and Script_Extensions properties (e.g. \p{sc=Kana} -> \p{sc=Katakana}).

I chose a minimal configuration because I didn't want to make the rule generate a lot of error for people trying to adapt the rule. I think the 2 effects work well in any code base, no matter what style they usually prefer. (1) simply removes an unnecessary prefix to "simplify" the regex, and (2) prevents the use of the (IMO) horrible aliases for scripts.

Unicode data

Since I needed the data for the mapping between aliases to implement this rule, I had to make the choice between taking a dependency (e.g. @unicode/unicode-15.0.0) or including the relevant data in the source files of this project.

I chose against adding a dependency, because it was easy enough to get the data I needed and because most of @unicode/unicode-15.0.0 would be dead weight to us.

However, the data I included is used through an API (the AliasMap class), so we can easily switch to using a dependency without needing to change the regexp/unicode-property rule.

changeset-bot[bot] commented 7 months ago

🦋 Changeset detected

Latest commit: f2ff74524a269532d7f83d9f9b66a8ace9de4edc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package | Name | Type | | -------------------- | ----- | | eslint-plugin-regexp | Minor |

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR