stdlib-js / stdlib

✨ Standard library for JavaScript and Node.js. ✨
https://stdlib.io
Apache License 2.0
4.32k stars 437 forks source link

[RFC]: Improvements to @stdlib/nlp-expand-contractions #496

Open titanism opened 2 years ago

titanism commented 2 years ago

Description

We're writing as we found your library to be the most tested and fastest for expanding contractions. For context, we're working on https://spamscanner.net and expanding contractions before passing to tokenizers for spam classification.

To clarify, this is with regards to the generated codebase https://github.com/stdlib-js/nlp-expand-contractions from the source at https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/nlp/expand-contractions.

We noticed that your library is missing quite a few contractions in English, and could also benefit from contractions from other languages too (perhaps with an option).

While we can open a PR, we wanted to check to see what your thoughts were on this and how you might want the PR to look like (integration wise; e.g. new options?).

Here is our current compiled list of research and findings:

Related Issues

No response

Questions

No response

Other

No response

Checklist

github-actions[bot] commented 2 years ago

:tada: Welcome! :tada:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

titanism commented 2 years ago

Doing a review and will submit a PR to contractions.json with changes.

Caught some interesting bugs like "what's": "what has/is", in the JSON (which is obviously a bug).

The other question I wanted to raise is that we should probably handle and and interchangeably somehow.

kgryte commented 2 years ago

Re: missing contractions. Some of the entries in your list are already present in the contractions file. E.g., wouldn't've, mightn't've.

kgryte commented 2 years ago

@Planeshifter Is there a reason for the what has/is entry?

kgryte commented 2 years ago

Re: fancy apostrophe. That should be possible to handle in the @stdlib/nlp/tokenize package.

titanism commented 2 years ago

I'm about to submit a PR, one moment @kgryte

titanism commented 2 years ago

See https://github.com/stdlib-js/stdlib/pull/497

cc @kgryte

kgryte commented 2 years ago

@titanism One recent update: @Planeshifter added initial support for expanding acronyms (see https://github.com/stdlib-js/stdlib/tree/c624a5eb4bca8f4f3d45e01bcc4eeee41652e3ba/lib/node_modules/%40stdlib/nlp/expand-acronyms). This may help to avoid mixing contraction/acronym concerns.