non-english parentheses tokenization

spencermountain / compromise

modest natural-language processing

http://compromise.cool

MIT License

11.47k stars 654 forks source link

non-english parentheses tokenization #896

Open kant01ne opened 2 years ago

kant01ne commented 2 years ago

compromise@13.11.4

Noticed the lib is removing some symbols unexpectedly when running the json methods.

Ex:

The same issue happens with foo[bar] or foo{bar}.

Expected behaviour:

nlp('foo{bar} foo').json() =>
[Object {]()
  text: "foo(bar) foo"
  terms: [Array(2) []()
  0: [Object {]()text: "foo(bar)", tags: Array(2), pre: "", post: " "}
  1: [Object {]()text: "foo", tags: Array(2), pre: "", post: ""}
]

spencermountain commented 2 years ago

hey @kant01ne - thanks for the good bug. yeah, the tokenizer attempts to put punctuation in the pre/post fields - but in this case, (if there is a matching bracket in the term text) i agree that it should retain the ending bracket. will add this on the list, for v14 cheers

kant01ne commented 2 years ago

Thanks 🙏 @spencermountain