microsoft / PhoneticMatching

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.
MIT License
152 stars 31 forks source link

Build status

Introduction

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.

Docs can be found at: https://microsoft.github.io/PhoneticMatching/

Supported API:

Supported Languages

Current pre-built binaries offered to save the trouble of compiling the source locally.

(Run node -p "process.versions.modules" to see which Node-ABI in use.)

Getting Started

This repository consists of TypeScript and native dependencies built with node-gyp. See package.json for various scripts for the development process.

For first time building remember to npm install

This repository uses git submodules. If paths are outdated or non-existent run git submodule update --init --recursive

Install

To install from NPM

npm install phoneticmatching

Usage

See the typings for more details.
Classes prefixed with En make certain assumptions that are specific to the English language.

import { EnPronouncer, EnPhoneticDistance, FuzzyMatcher, AcceleratedFuzzyMatcher, EnHybridDistance, StringDistance } from "phoneticmatching";

Speech The namespace containing the type interfaces of the library objects.

EnPronouncer Pronounces a string, as a General English speaker, into its IPA string or array of Phones format.

matchers module:

distance module:

nlp module:

Here are some example of how to import modules and classes:

import { EnContactMatcher, EnPlaceMatcher } from "phoneticmatching";
import * as Matchers from "phoneticmatching/lib/matchers";

Example

JavaScript

// Import core functionality from the library.
const { EnPhoneticDistance, FuzzyMatcher } = require("phoneticmatching");

// A distance metric over pronunciations.
const metric = new EnPhoneticDistance();

// The target list to match against.
const targets = [
    "Apple",
    "Banana",
    "Blackberry",
    "Blueberry",
    "Grapefruit",
    "Pineapple",
    "Raspberry",
    "Strawberry",
];

// Create the fuzzy matcher.
const matcher = new FuzzyMatcher(targets, metric);
// Find the nearest match.
const result = matcher.nearest("blu airy");
/* The result should be:
 * {
 *     // The object from the targets list.
 *     element: 'Blueberry',
 *     // The distance score the from distance function.
 *     distance: 0.041666666666666664
 * }
 */
console.log(result);

C#

using System;

// Import core functionality from the library.
using Microsoft.PhoneticMatching.Matchers.FuzzyMatcher.Normalized;

public class Program
{
    public static void Main(string[] args)
    {
        // The target list to match against.
        string[] targets = 
        {
            "Apple",
            "Banana",
            "Blackberry",
            "Blueberry",
            "Grapefruit",
            "Pineapple",
            "Raspberry",
            "Strawberry",
        };

        // Create the fuzzy matcher.
        var matcher = new EnPhoneticFuzzyMatcher<string>(targets);

        // Find the nearest match.
        var result = matcher.FindNearest("blu airy");

        /* The result should be:
         * {
         *     // The object from the targets list.
         *     element: 'Blueberry',
         *     // The distance score the from distance function.
         *     distance: 0.0416666666666667
         * }
         */
        Console.WriteLine("element : [{0}] - distance : [{1}]", result.Element, result.Distance);
    }
}

Build

TypeScript Transpiling

npm run tsc

Native Compiling

# X is the parallelization number, usually set to the number of cores of the machine.
# This cleans and rebuilds everything.
JOBS=X npm run rebuild
# For incremental builds.
JOBS=X npm run build

Test

# Requires native dependencies built, but TypeScript transpiling not required.
npm test

Docs

# Generate the doc files from the docstrings.
npm run build-docs

Release

# Builds everything, TypeScript & native & docs, as a release build.
npm run release

Deployment/Upload

Note that the .js library code and native dependencies will be deployed separately. Npm registries will be used for the .js code, node-pre-gyp will be used for prebuilt dependencies while falling back to building on the client.

# Pushes pack to npmjs.com or a private registry if a .npmrc exists.
npm publish
# Packages a ./build/stage/{version}/maluubaspeech-{node_abi}-{platform}-{arch}.tar.gz.
# See package.json:binary.host on where to put it.
npm run package

NuGet Publish

A .NET Core NuGet package is published for this project. The package is published by Microsoft. Hence, it must follow guidance at https://aka.ms/nuget and sign package content and package itself with an official Microsoft certificate. To ease signing and publishing process, we integrate ESRP signing to Azure DevOps build tasks. To publish a new version of the package, create a release for the latest build (Pipelines->Releases->PublishNuget->Create a release).

Contributors

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Reporting Security Issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) at secure@microsoft.com. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

See sources for licenses of dependencies.