Closed kelvinhammond closed 4 years ago
hey @kelvinhammond that sounds really neat. I'd love to hear more about it.
Yeah, the internal-api is a bit complicated. The groups concept especially so. You're right that buildFrom
is the method I use to compose new Text objects, with a shared context and data.
the best way to get a feel for groups and buildfrom may be to place a console.log here. Groups is an object - I think numbers will also work as the keys, like you mentioned.
yeah, if it is not set correctly, the linked-list will blow-up when you call .text()
. Crawling that is the most expensive operation we have, so it's not possible to validate it at create.
Can help further if you have more questions
cheers
@spencermountain
So far I've:
I still need to implement:
$
When it comes to compromise currently in my scratch I'm doing:
let ast = null;
// parses the regex and set the ast.prog to a "program" which runs in the vm
ast = parse("remind (you|me|to)+");
const doc = nlp('remind me to work on a better match');
// matches = { found: true, saved: [terms] }. this will change
const matches = pikevm(ast.prog, doc.list[0].terms());
console.log(matches?.saved?.map(x => x.text));
outputs:
[ 'remind', 'me', 'to' ]
with a non greedy +?
modifier ast = parse("remind (you|me|to)+?");
produces [ 'remind', 'me' ]
What is the correct way to generate a document with groups should I create a plugin that adds match2
or something which matches words and returns a list of terms from the termList
?
Are there any other features you can think of?
HEY COOL!! this is a very neat idea.
... is it fast?? Is that why you're building it? How much bigger is the build with chevrotain?
to be honest, if the tests pass, it works. That's as close there is to a formal spec or anything. There are a lot of weird edge-cases with greedy optionals that compromise doesn't handle properly. I'm sure you've already found them.
Okay, so to answer your question,
const doc = nlp('remind me to work on a better match')
let m = doc.match('remind [<me>me] to')
// get a flat list of Term objects
let terms = m.termList()
// [{Term}, {Term}, {Term}]
let groups = m.groups()
// {me: {Phrase} }
//okay, create a Phrase obj from our terms & groups:
let phrase = doc.list[0].buildFrom(terms[0].id, terms.length, groups)
// create a Text obj from our phrase
let doc2 = doc.buildFrom([phrase])
doc2.debug()
// ensure group is still there:
m.groups('me').debug()
cheers
ya, looking at that, I know that's a bit laborious, but that's how i'd do it.
also, if it could make your life easier, we could always make changes to the match syntax. If it meant formalizing it, as a proper dsl, I wouldn't mind dropping some features.
I've tried a few times to make some kind of abstraction for match-syntax - so you can write them as strings, but then if performance is important, turn it into a formal json object.
...If it's 1-to-many, we can optimize the match statement,
if it's many-to-1, we can optimize the document...
any thoughts or advice you have is welcomed
Updates:
Features:
Questions:
~soft~
support but I need to know more about how that works / how to implement a match for that on a single term given the soft
part, should probably change the syntax on this though.match2
any objections?const reStr = 'remind (me)? to? (?P<what>.+?) (?P<when>#Date+)$';
const regex = nlp.regexp(reStr);
//const doc = nlp("remind me to wash the dishes in 5 minutes").match2(reStr); // also works
const doc = nlp("remind me to wash the dishes in 5 minutes").match2(regex);
if(doc.found) {
console.log(doc.text());
console.log(doc.groups());
}
So far I've done this to run the matcher.
const m = doc.buildFrom(doc.list.map(phrase => {
const { found, saved, groups } = pikevm(ast.prog, phrase.terms());
const namedGroups = {}; // TODO: munge to correct format
return found ?
phrase.buildFrom(saved[0].id, saved?.length ?? 0, namedGroups)
: null;
}).filter(p => p !== null));
Attached is a screenshot of the syntax diagram. I haven't implemented negative groups yet, might do later.
I have groups and group parsing done, now I just need to:
compile
usageconst m = doc.buildFrom(doc.list.map(phrase => {
const { found, saved, groups } = pikevm(ast.prog, phrase.terms());
console.log("found:", found);
console.log("saved:", saved);
console.log("groups:", groups);
const namedGroups = Object.values(groups).reduce((arr, g) => ({
...arr,
[parseInt(g.id)]: {
group: g?.name ?? `${g.id}`,
start: g.saved?.[0].id ?? 0,
length: g.saved?.length ?? 0
}
}), {});
console.log(namedGroups);
return found ?
phrase.buildFrom(saved[0].id, saved.length, namedGroups)
: null;
}).filter(p => p !== null));
@spencermountain What are your arguments against the GPL license?
I'm thinking of releasing this as a plugin.
I could include it into this repo under plugins/match2
with its own plugins/match2/LICENSE
file which would be GPL.
Or I could make a separate repo.
I may be convinced to use the MIT license too though. What are your thoughts?
Holiday today in Canada, will give this proper attention as soon as I can. Thanks for the information.
On Tue., Jun. 30, 2020, 12:49 a.m. Kelvin, notifications@github.com wrote:
I have groups and group parsing done, now I just need to:
- figure out where the code goes, library or add to compromise
- clean things up a bit
- create a regex class for compile usage
- make a match function which takes compiled regex or string regex which it then compiles.
- write some tests
const m = doc.buildFrom(doc.list.map(phrase => { const { found, saved, groups } = pikevm(ast.prog, phrase.terms()); console.log("found:", found); console.log("saved:", saved); console.log("groups:", groups); const namedGroups = Object.values(groups).reduce((arr, g) => ({ ...arr,
group: g?.name ?? `${g.id}`, start: g.saved?.[0].id ?? 0, length: g.saved?.length ?? 0 } }), {}); console.log(namedGroups); return found ? phrase.buildFrom(saved[0].id, saved.length, namedGroups) : null;
}).filter(p => p !== null));
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/765#issuecomment-651532156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKINB5G3L22GCVQWH4TRZFVF7ANCNFSM4OJ3RYKQ .
@spencermountain plugin deployed here https://www.npmjs.com/package/compromise-match2 I could merge it into compromise
hey, so I finally got a chance to check this out in detail. This is very neat stuff. I have a bunch of questions, but they are minor, and I am very excited and impressed by this plugin.
1) do you have any sense about its performance, compared with regular match? I tried a few things and it seemed to be similar? I can help throw some bigger documents at it, if that would be helpful.
2) oof - I'm not the right person to ask about licensing details. That stuff goes out my brain when I blink. You're free to use anything you're comfortable with.
3) found 2 errors, I think:
// #1
nlp('so ralf and really eats the glue').match2('* [eats] the', 0)
#2
nlp('spencer other').match2('!(cool|spencer)')
it was cool that it also found an invalid match statement in the tests!
4) the min build seems to be 130kb, which is cool - are there any quick-wins to shake-down chevrotrain?
5) it would be cool to support the AND syntax, which is often-requested
let m = nlp('june and july cool').match2('(#Date && july)')
m.text() // july
6) very-happy to put this in ./plugins, and list it in the docs. We'll need to add details to the docs.
7 (last) ) goodmatch
? smartmatch
? I welcome all slams to the original match function :)
thank you!!
@spencermountain
[eat]
is invalid it should be more like .* (eats) the
. second one would be (?!cool|spencer)
a. .
matches anything, *
modifier zero or moe, (eats)
group which is captured and matches word eats
.
b. the (?!
starts a negative lookahead assertion which does not capture or advance position in search
c. What does the second arg do, return that group number / name? I can implement it.(?=#Date) July
.
a. I found a bug in the not start of matching that I've fixed in 1.0.1 which is live now. The above lookahead should work with this now. Other matches not starting at the first two terms should now work too.benchmark:
onst Benchmark = require('benchmark');
const nlp = require('compromise');
const { default: compromise_match2, NLPRegexP } = require('./dist/compromise-match2.js');
nlp.extend(compromise_match2);
const suite = new Benchmark.Suite;
let text = "Improved own provided blessing may peculiar domestic. Sight house has sex never. No visited raising gravity outward subject my cottage mr be. Hold do at tore in park feet near my case. Invitation at understood occasional sentiments insipidity inhabiting in. Off melancholy alteration principles old. Is do speedily kindness properly oh. Respect article painted cottage he is offices parlors.";
text = text.replace('.', '');
const tofind = 'sentiments insipidity inhabiting';
const regex = nlp.compileRegex(tofind);
suite
.add('nlp match', () => {
nlp(text).match(tofind);
})
.add('match2', () => {
nlp(text).match2(tofind);
})
.add('compiled match2', () => {
nlp(text).match2(regex);
})
.on('cycle', function (event) {
console.log(String(event.target));
})
.on('complete', function (event) {
console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run();
cool. That sounds great. Thanks for clarifying that. It would help to have a list of syntax-differences between the two implementations. It's cool to have them different, that's no issue. We should try to be as clear as we can about these tiny differences - people are gonna get caught between them.
You're right that the 2nd param in .match()
is a group name, so you can return just the group -
doc.match('before [want] after', 0).text()
// 'want'
I really struggled with the ()
vs []
syntax. I'm not sure it ended-up being clear.
(
was originally used to support optional 'or' logic, so we used square-brackets for capture-groups.
... then regex started supporting named capture-groups. Now the situation is a bit of a mess, and it's almost-certainly too-late to release a big breaking change to the match syntax.
I can help track-down tiny-differences between the two implementations - does match2 support recursive matching? .... recursive groups? cheers!
The syntax I create is meant to match more closely to regular expression, python's re module has a good doc on the syntax.
I'll update the match2
to support a second param for returning the first group. It does support nested groups but similar to other regex engines if a group is matched more than one it only returns the last match for that group.
Patterns like this ((remind|remember) (me|you|.) to? do? (?P<what>.+) (?P<when>#Date+))
are supported which will return
for remind me to reply to @spencermountain today
:
remind me to reply to @spencermountain today
remind
me
reply to @spencermountain
today
I'll try to work on some more docs / examples. I have some of them outlined in the README but it could definitely be clearer and should provide examples.
@spencermountain Here I've release version 1.1.0 that fixes things, adds 100% code coverage, and a few fixes.
The tests, starting here can be used as a reference for now until I can provide better examples.
https://github.com/catalogm/compromise-match2/commit/70fef68eb67725eb0f006954c9efc99c25821a7c
@spencermountain I still need to write some better docs but I added support for methods using the @
(ex: @hasComma
).
The doc.lists()
fails when the items contain spaces, likely because the match
function does not work well with spaces. What needs to happen to get compromise-match2
merged into this repo so that you or I can possibly update some of the functions to use match2
?
There are a few issues:
match2
uses a different albeit closer to standard regex formatmatch2
does not support everything match
does but these features can be added or there are different ways to do them in match2
Agh, that’s great. Happy to merge this in! Can you add me as an owner on npm, if I’m gonna help maintain it?
On Fri, Jul 24, 2020 at 2:12 PM Kelvin notifications@github.com wrote:
@spencermountain https://github.com/spencermountain I still need to write some better docs but I added support for methods using the @ (ex: @hasComma).
The doc.lists() fails when the items contain spaces, likely because the match function does not work well with spaces. What needs to happen to get compromise-match2 merged into this repo so that you or I can possibly update some of the functions to use match2?
There are a few issues:
- LICENSE is GPLv3 which conflicts with the MIT license this repo uses.
- Module size will increase by the added dependencies
- match2 uses a different albeit closer to standard regex format
- match2 does not support everything match does but these features can be added or there are different ways to do them in match2
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/765#issuecomment-663667133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBSKPTHEAZBGO4VLUXIKDR5HFIJANCNFSM4OJ3RYKQ .
@spencermountain any responses to the above comment?
hey Kelvin, sorry for the delay. I'd be happy to merge this into the plugins directory, and help to maintain it. Thanks for doing this. I don't know anything about licensing, and am not worried about keeping it GPL. yes, we'll need to do some work documenting the changes between it, and the default matcher. Really comprehensive docs will go a long way, i'd be happy to help.
I tried match2 on the compromise tests and a few-dozen were failing. It would be awesome if it passed the tests, so we could say it was purely a more-strict version of the match syntax. That may soften-up some of the edge-cases between the two systems. That's up to you though.
Apologies if this was addressed before, but if I understand it, match2 does not support this still, right?
(start|(maybe one|maybe two))
that's a kinda recursion, i suppose. Just double-checking.
cheers!
@spencermountain I'll make a pull request for compromise with the plugin.
(start|(maybe one|maybe two))
match2 supports this.
amazing! sounds great. Let me know if I can help.
I've been using observablehq for docs, which is a normal static markdown thing, but also lets people play with, and fork js code examples. People seem to like it.
can't wait to play with this some more.
hey @kelvinhammond how's it goin on this? If you want, I could copy+paste things into this repo. We'd lose the git-history, but I'd cite your name on the plugin. Otherwise I can add a link to your repo on this readme or something. Whatever you prefer! cheers
@spencermountain sorry, I've not had much time to get to it. We'll lose git history anyway if I copy it to your repo so I'm fine with that. I'll try to copy it in and create a PR for it later today. I still need to work on documenting things though.
@spencermountain see PR #796
Hello,
I'm building my own match function, probably add it as an extension because it requires a dependency which will increase the code size.
I have so far figured out that I can use
termList
to get all the terms to match against and now I'm trying to save the groups. How do I set the groups for a matched term, I need to set multiple by name, number, and the total match as group0
? The code for thematch
function is a bit hard to follow.Also turns out using a subset of
termList
withbuildFrom
to build a doc doesn't actually work as expected, it fails when I runtext()
on the new document.