node-unicode / node-unicode-data

JavaScript-compatible Unicode data generator. Arrays of code points, arrays of symbols, and regular expressions for every Unicode version’s categories, scripts, blocks, and properties — neatly packaged into a separate npm package per Unicode version.
https://mths.be/node-unicode-data
MIT License
145 stars 15 forks source link

node-unicode-data

JavaScript-compatible Unicode data generator. Arrays of code points, arrays of symbols, and regular expressions for every Unicode version’s categories, scripts, script extensions, blocks, bidi data, and other properties — neatly packaged into a separate npm package per Unicode version.

Using the data in your scripts

To use the generated data, simply install one of the npm modules generated by this script. Separate packages are available for each Unicode version. This allows you to do stuff like:

// Get an array of all code points with the `White_Space` property:
const codePoints = require('@unicode/unicode-6.3.0/Binary_Property/White_Space/code-points');
// Get an array of strings (containing one symbol each) in the `Lu` category:
const symbols = require('@unicode/unicode-6.3.0/General_Category/Uppercase_Letter/symbols');
// Get a regular expression that matches any symbol in the `Aegean Numbers` block:
const regex = require('@unicode/unicode-6.3.0/Block/Aegean_Numbers/regex');
// Get an array of all code points in the `Egyptian_Hieroglyphs` script:
const hieroglyphs = require('@unicode/unicode-6.3.0/Script/Egyptian_Hieroglyphs/code-points');
// Get the canonical category a given code point belongs to:
// (Note: U+0041 is LATIN CAPITAL LETTER A)
const category = require('@unicode/unicode-6.3.0/General_Category').get(0x41);
// Get an array of all code points with a given bidi class:
const lre = require('@unicode/unicode-6.3.0/Bidi_Class/Left_To_Right_Embedding/code-points');
// Get the directionality of a given code point:
const directionality = require('@unicode/unicode-6.3.0/Bidi_Class').get(0x41);
// What glyph is the mirror image of `«` (U+00AB)?
const mirrored = require('@unicode/unicode-6.3.0/Bidi_Mirroring_Glyph').get(0xAB);
// Get a regular expression that matches all opening brackets:
const openingBrackets = require('@unicode/unicode-6.3.0/Bidi_Paired_Bracket_Type/Open/regex');
// …you get the idea.

For more information, see the README for the package you’re interested in. Here’s the full list of npm packages generated by this script:

Note that these READMEs are auto-generated by this script, too – they describe all the data that is available for that particular Unicode version. To programmatically get this list of available categories, scripts, script extensions, blocks, and properties for a given Unicode version, just require the main module for that version:

> require('unicode-6.3.0');
{
    'Binary_Property': [
        'Alphabetic', 'Any', 'ASCII', 'ASCII_Hex_Digit', 'Assigned', …
    ],
    'General_Category': [
        'Cased_Letter','Close_Punctuation','Connector_Punctuation', …
    ],
    'Script': [
        'Arabic', 'Armenian', 'Avestan', …
    ],
    'Script_Extensions': [
        'Arabic', 'Armenian', 'Avestan', …
    ],
    'Block': [
        'Aegean Numbers', 'Alchemical Symbols', …
    ],
    'Case_Folding': [
        'C', 'F', 'S', 'T'
    ],
    'Simple_Case_Mapping': [
        'Uppercase', 'Lowercase', 'Titlecase'
    ],
    'Special_Casing': [
        'Uppercase', 'Lowercase', 'Titlecase', …
    ],
    'Bidi_Class': [
        'Arabic_Letter', 'Arabic_Number', 'Boundary_Neutral', …
    ],
    'Bidi_Mirroring_Glyph': [],
    'Bidi_Paired_Bracket_Type': [
        'Close', 'None', 'Open'
    ]
}

For project maintainers

After cloning this repository, before doing anything else, run:

./clone-repos.sh

This clones all the generated repositories to your local output folder. You can then make changes to node-unicode-data, and use ./bootstrap.sh to commit and push changes to each of these repositories.

Generating the data

npm run download (re-)downloads the Unicode source files for all the Unicode versions defined in data/resources.js, saving them in the data folder.

npm run build generates data for all the Unicode versions defined in data/resources.js. This may take a few minutes… The regular expressions are generated using Regenerate.

Testing

npm test generates the data for the oldest and latest available Unicode version. This is a good way to test changes to the generator scripts before running npm run-script generate.

npm run-script cover generates the code coverage report.

Author

twitter/mathias
Mathias Bynens

License

This module is available under the MIT license.