spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.51k stars 654 forks source link

new match2 plugin #765

Closed kelvinhammond closed 4 years ago

kelvinhammond commented 4 years ago

Hello,

I'm building my own match function, probably add it as an extension because it requires a dependency which will increase the code size.

I have so far figured out that I can use termList to get all the terms to match against and now I'm trying to save the groups. How do I set the groups for a matched term, I need to set multiple by name, number, and the total match as group 0? The code for the match function is a bit hard to follow.

Also turns out using a subset of termList with buildFrom to build a doc doesn't actually work as expected, it fails when I run text() on the new document.

spencermountain commented 4 years ago

hey @kelvinhammond that sounds really neat. I'd love to hear more about it. Yeah, the internal-api is a bit complicated. The groups concept especially so. You're right that buildFrom is the method I use to compose new Text objects, with a shared context and data.

the best way to get a feel for groups and buildfrom may be to place a console.log here. Groups is an object - I think numbers will also work as the keys, like you mentioned. yeah, if it is not set correctly, the linked-list will blow-up when you call .text(). Crawling that is the most expensive operation we have, so it's not possible to validate it at create. Can help further if you have more questions cheers

kelvinhammond commented 4 years ago

@spencermountain

So far I've:

  1. Used chevrotain as the parser for the regular expression. This is the library which I've imported which would increase build size.
  2. Implemented pike's vm similar to this but on words.
  3. Implemented groups with subgroup support
  4. Match zero or one, zero or more, and one or more
  5. Greedy / non greedy matching

I still need to implement:

When it comes to compromise currently in my scratch I'm doing:

let ast = null;
// parses the regex and set the ast.prog to a "program" which runs in the vm
ast = parse("remind (you|me|to)+");  
const doc = nlp('remind me to work on a better match');
// matches = { found: true, saved: [terms] }. this will change
const matches = pikevm(ast.prog, doc.list[0].terms());
console.log(matches?.saved?.map(x => x.text));

outputs:

[ 'remind', 'me', 'to' ]

with a non greedy +? modifier ast = parse("remind (you|me|to)+?"); produces [ 'remind', 'me' ]

What is the correct way to generate a document with groups should I create a plugin that adds match2 or something which matches words and returns a list of terms from the termList?

Are there any other features you can think of?

spencermountain commented 4 years ago

HEY COOL!! this is a very neat idea.

... is it fast?? Is that why you're building it? How much bigger is the build with chevrotain?

to be honest, if the tests pass, it works. That's as close there is to a formal spec or anything. There are a lot of weird edge-cases with greedy optionals that compromise doesn't handle properly. I'm sure you've already found them.


Okay, so to answer your question,

const doc = nlp('remind me to work on a better match')
let m = doc.match('remind [<me>me] to')
// get a flat list of Term objects
let terms = m.termList()
// [{Term}, {Term}, {Term}]
let groups = m.groups()
// {me: {Phrase} }

//okay, create a Phrase obj from our terms & groups:
let phrase = doc.list[0].buildFrom(terms[0].id, terms.length, groups)
// create a Text obj from our phrase
let doc2 = doc.buildFrom([phrase])
doc2.debug()

// ensure group is still there:
m.groups('me').debug()

cheers

spencermountain commented 4 years ago

ya, looking at that, I know that's a bit laborious, but that's how i'd do it.

also, if it could make your life easier, we could always make changes to the match syntax. If it meant formalizing it, as a proper dsl, I wouldn't mind dropping some features.

I've tried a few times to make some kind of abstraction for match-syntax - so you can write them as strings, but then if performance is important, turn it into a formal json object.

...If it's 1-to-many, we can optimize the match statement,

if it's many-to-1, we can optimize the document...

any thoughts or advice you have is welcomed

kelvinhammond commented 4 years ago

Updates:

  1. I haven't built compromise with my match function yet
  2. chevrotain can build a syntax diagram from the code I've created, I just need to figure it out.
  3. I've changed the syntax to be more like actual regex but it operates on terms instead.

Features:

Questions:

const reStr = 'remind (me)? to? (?P<what>.+?) (?P<when>#Date+)$';
const regex = nlp.regexp(reStr);
//const doc = nlp("remind me to wash the dishes in 5 minutes").match2(reStr);  // also works
const doc = nlp("remind me to wash the dishes in 5 minutes").match2(regex);
if(doc.found) {
  console.log(doc.text());
  console.log(doc.groups());
}

So far I've done this to run the matcher.

const m = doc.buildFrom(doc.list.map(phrase => {
    const { found, saved, groups } = pikevm(ast.prog, phrase.terms());
    const namedGroups = {};  // TODO: munge to correct format
    return found ?
      phrase.buildFrom(saved[0].id, saved?.length ?? 0, namedGroups)
      : null;
  }).filter(p => p !== null));

Attached is a screenshot of the syntax diagram. I haven't implemented negative groups yet, might do later. Screenshot from 2020-06-29 22-09-06 Screenshot from 2020-06-29 22-11-00

kelvinhammond commented 4 years ago

I have groups and group parsing done, now I just need to:

const m = doc.buildFrom(doc.list.map(phrase => {
    const { found, saved, groups } = pikevm(ast.prog, phrase.terms());
    console.log("found:", found);
    console.log("saved:", saved);
    console.log("groups:", groups);
    const namedGroups = Object.values(groups).reduce((arr, g) => ({
      ...arr,
      [parseInt(g.id)]: {
        group: g?.name ?? `${g.id}`,
        start: g.saved?.[0].id ?? 0,
        length: g.saved?.length ?? 0
      }
    }), {});
    console.log(namedGroups);
    return found ?
      phrase.buildFrom(saved[0].id, saved.length, namedGroups)
      : null;
  }).filter(p => p !== null));
kelvinhammond commented 4 years ago

@spencermountain What are your arguments against the GPL license?

I'm thinking of releasing this as a plugin. I could include it into this repo under plugins/match2 with its own plugins/match2/LICENSE file which would be GPL. Or I could make a separate repo.

I may be convinced to use the MIT license too though. What are your thoughts?

spencermountain commented 4 years ago

Holiday today in Canada, will give this proper attention as soon as I can. Thanks for the information.

On Tue., Jun. 30, 2020, 12:49 a.m. Kelvin, notifications@github.com wrote:

I have groups and group parsing done, now I just need to:

  • figure out where the code goes, library or add to compromise
  • clean things up a bit
  • create a regex class for compile usage
  • make a match function which takes compiled regex or string regex which it then compiles.
  • write some tests

const m = doc.buildFrom(doc.list.map(phrase => { const { found, saved, groups } = pikevm(ast.prog, phrase.terms()); console.log("found:", found); console.log("saved:", saved); console.log("groups:", groups); const namedGroups = Object.values(groups).reduce((arr, g) => ({ ...arr,

    group: g?.name ?? `${g.id}`,
    start: g.saved?.[0].id ?? 0,
    length: g.saved?.length ?? 0
  }
}), {});
console.log(namedGroups);
return found ?
  phrase.buildFrom(saved[0].id, saved.length, namedGroups)
  : null;

}).filter(p => p !== null));

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/765#issuecomment-651532156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKINB5G3L22GCVQWH4TRZFVF7ANCNFSM4OJ3RYKQ .

kelvinhammond commented 4 years ago

@spencermountain plugin deployed here https://www.npmjs.com/package/compromise-match2 I could merge it into compromise

spencermountain commented 4 years ago

hey, so I finally got a chance to check this out in detail. This is very neat stuff. I have a bunch of questions, but they are minor, and I am very excited and impressed by this plugin.

1) do you have any sense about its performance, compared with regular match? I tried a few things and it seemed to be similar? I can help throw some bigger documents at it, if that would be helpful.

2) oof - I'm not the right person to ask about licensing details. That stuff goes out my brain when I blink. You're free to use anything you're comfortable with.

3) found 2 errors, I think:

// #1
nlp('so ralf and really eats the glue').match2('* [eats] the', 0)

#2 
nlp('spencer other').match2('!(cool|spencer)')

it was cool that it also found an invalid match statement in the tests!

4) the min build seems to be 130kb, which is cool - are there any quick-wins to shake-down chevrotrain?

5) it would be cool to support the AND syntax, which is often-requested

let m = nlp('june and july cool').match2('(#Date && july)')
m.text() // july

6) very-happy to put this in ./plugins, and list it in the docs. We'll need to add details to the docs.

7 (last) ) goodmatch? smartmatch? I welcome all slams to the original match function :)

thank you!!

kelvinhammond commented 4 years ago

@spencermountain

  1. I tested, the speed is about the same, code below, match2 is mostly meant to match things better and have a syntax closer to regex. Some examples are that you can match a long string of words, have subgroups, lookaheads, etc.
  2. I chose the GPL.
  3. [eat] is invalid it should be more like .* (eats) the. second one would be (?!cool|spencer) a. . matches anything, * modifier zero or moe, (eats) group which is captured and matches word eats. b. the (?! starts a negative lookahead assertion which does not capture or advance position in search c. What does the second arg do, return that group number / name? I can implement it.
  4. Not sure, I already ran rollup on it, not sure what can be removed that wouldn't be needed. I tried not to use too many functions. Any pointers?
  5. hmm, this can probably be done with a lookahead assertion like this (?=#Date) July. a. I found a bug in the not start of matching that I've fixed in 1.0.1 which is live now. The above lookahead should work with this now. Other matches not starting at the first two terms should now work too.

benchmark:

onst Benchmark = require('benchmark');

const nlp = require('compromise');
const { default: compromise_match2, NLPRegexP } = require('./dist/compromise-match2.js');

nlp.extend(compromise_match2);

const suite = new Benchmark.Suite;

let text = "Improved own provided blessing may peculiar domestic. Sight house has sex never. No visited raising gravity outward subject my cottage mr be. Hold do at tore in park feet near my case. Invitation at understood occasional sentiments insipidity inhabiting in. Off melancholy alteration principles old. Is do speedily kindness properly oh. Respect article painted cottage he is offices parlors.";
text = text.replace('.', '');
const tofind = 'sentiments insipidity inhabiting';
const regex = nlp.compileRegex(tofind);

suite
  .add('nlp match', () => {
    nlp(text).match(tofind);
  })  
  .add('match2', () => {
    nlp(text).match2(tofind);
  })  
  .add('compiled match2', () => {
    nlp(text).match2(regex);
  })  
  .on('cycle', function (event) {
    console.log(String(event.target));
  })  
  .on('complete', function (event) {
    console.log('Fastest is ' + this.filter('fastest').map('name'));
  })  
  .run();
spencermountain commented 4 years ago

cool. That sounds great. Thanks for clarifying that. It would help to have a list of syntax-differences between the two implementations. It's cool to have them different, that's no issue. We should try to be as clear as we can about these tiny differences - people are gonna get caught between them.

You're right that the 2nd param in .match() is a group name, so you can return just the group -

doc.match('before [want] after', 0).text()
// 'want'

I really struggled with the () vs [] syntax. I'm not sure it ended-up being clear.

( was originally used to support optional 'or' logic, so we used square-brackets for capture-groups. ... then regex started supporting named capture-groups. Now the situation is a bit of a mess, and it's almost-certainly too-late to release a big breaking change to the match syntax.

I can help track-down tiny-differences between the two implementations - does match2 support recursive matching? .... recursive groups? cheers!

kelvinhammond commented 4 years ago

The syntax I create is meant to match more closely to regular expression, python's re module has a good doc on the syntax.

I'll update the match2 to support a second param for returning the first group. It does support nested groups but similar to other regex engines if a group is matched more than one it only returns the last match for that group.

Patterns like this ((remind|remember) (me|you|.) to? do? (?P<what>.+) (?P<when>#Date+)) are supported which will return for remind me to reply to @spencermountain today :

I'll try to work on some more docs / examples. I have some of them outlined in the README but it could definitely be clearer and should provide examples.

kelvinhammond commented 4 years ago

@spencermountain Here I've release version 1.1.0 that fixes things, adds 100% code coverage, and a few fixes.

The tests, starting here can be used as a reference for now until I can provide better examples.

https://github.com/catalogm/compromise-match2/commit/70fef68eb67725eb0f006954c9efc99c25821a7c

kelvinhammond commented 4 years ago

@spencermountain I still need to write some better docs but I added support for methods using the @ (ex: @hasComma).

The doc.lists() fails when the items contain spaces, likely because the match function does not work well with spaces. What needs to happen to get compromise-match2 merged into this repo so that you or I can possibly update some of the functions to use match2?

There are a few issues:

spencermountain commented 4 years ago

Agh, that’s great. Happy to merge this in! Can you add me as an owner on npm, if I’m gonna help maintain it?

On Fri, Jul 24, 2020 at 2:12 PM Kelvin notifications@github.com wrote:

@spencermountain https://github.com/spencermountain I still need to write some better docs but I added support for methods using the @ (ex: @hasComma).

The doc.lists() fails when the items contain spaces, likely because the match function does not work well with spaces. What needs to happen to get compromise-match2 merged into this repo so that you or I can possibly update some of the functions to use match2?

There are a few issues:

  • LICENSE is GPLv3 which conflicts with the MIT license this repo uses.
  • Module size will increase by the added dependencies
  • match2 uses a different albeit closer to standard regex format
  • match2 does not support everything match does but these features can be added or there are different ways to do them in match2

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/765#issuecomment-663667133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKPTHEAZBGO4VLUXIKDR5HFIJANCNFSM4OJ3RYKQ .

kelvinhammond commented 4 years ago

@spencermountain any responses to the above comment?

spencermountain commented 4 years ago

hey Kelvin, sorry for the delay. I'd be happy to merge this into the plugins directory, and help to maintain it. Thanks for doing this. I don't know anything about licensing, and am not worried about keeping it GPL. yes, we'll need to do some work documenting the changes between it, and the default matcher. Really comprehensive docs will go a long way, i'd be happy to help.

I tried match2 on the compromise tests and a few-dozen were failing. It would be awesome if it passed the tests, so we could say it was purely a more-strict version of the match syntax. That may soften-up some of the edge-cases between the two systems. That's up to you though.

Apologies if this was addressed before, but if I understand it, match2 does not support this still, right?

(start|(maybe one|maybe two)) that's a kinda recursion, i suppose. Just double-checking. cheers!

kelvinhammond commented 4 years ago

@spencermountain I'll make a pull request for compromise with the plugin.

(start|(maybe one|maybe two))

match2 supports this.

spencermountain commented 4 years ago

amazing! sounds great. Let me know if I can help.

I've been using observablehq for docs, which is a normal static markdown thing, but also lets people play with, and fork js code examples. People seem to like it.

can't wait to play with this some more.

spencermountain commented 4 years ago

hey @kelvinhammond how's it goin on this? If you want, I could copy+paste things into this repo. We'd lose the git-history, but I'd cite your name on the plugin. Otherwise I can add a link to your repo on this readme or something. Whatever you prefer! cheers

kelvinhammond commented 4 years ago

@spencermountain sorry, I've not had much time to get to it. We'll lose git history anyway if I copy it to your repo so I'm fine with that. I'll try to copy it in and create a PR for it later today. I still need to work on documenting things though.

kelvinhammond commented 4 years ago

@spencermountain see PR #796