spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.31k stars 645 forks source link

Geedy tag matching and punctuation #1103

Open amorfee opened 3 months ago

amorfee commented 3 months ago

Hello,

I've come across an issue with greedy tag matching and comma separation. The comma is included as part of the match so that multiple tags are combined into one match.

Screenshot 2024-04-09 at 11 08 39

This is likely expected behaviour for a tag such as #Place but is there a way to force the comma as a word separator? Using .normalize() doesn't seem to help it just removes the comma from the match.

Thank you

spencermountain commented 3 months ago

hey, yeah good question. There are a few ways you could do this.

You could split by whatever, then filter then down:

let parts = doc.splitAfter('@hasComma');
parts = parts.if('#Place')

I sometimes do a aggressive split and then join em up, which is probably a weirder process:

let parts = doc.split('#Place')
parts= parts.joinIf('#Place && @hasComma', '#Place')

dunno! cheers

amorfee commented 3 months ago

Thank you, .splitAfter() seems to do what we need.