spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.49k stars 655 forks source link

Treating all dash-separated words as single words #1143

Closed retorquere closed 2 months ago

retorquere commented 2 months ago

Given the text

The multi-part  formulae-as-types notion of construction

compromise/one tokenizes multi-part as one term and formulae-as-types as 3. Can I make compromise tokenize all dash-separated words as it does multi-part? Or can I reconstruct that from the terms?

retorquere commented 2 months ago

Alternately, is there a way to distinguish between Grain-Based (to be treated as one term) and formulae-as-types (sensibly treated as 3)?

spencermountain commented 2 months ago

hey Emiliano, you're right. this is opaque, and should be made a lot clearer. There are a handful of prefixes that the tokenizer treats as single-word. These are mostly just made-up, and you can kill them off like this:

nlp.world().model.one.prefixes = {}
nlp(`The multi-part  formulae-as-types notion of construction`).debug()
// [the, multi, part, formulae, as ...]

I will try to add these to the docs now cheers

retorquere commented 2 months ago

And the reverse? If I want each hyphen-separated words as one unit?

spencermountain commented 2 months ago

you can add anything to the prefix model - it's just a key-value object:

 nlp.world().model.one.prefixes.myprefix = true

cheers

retorquere commented 2 months ago

The issue is I have no control over the input text, so the list of possible prefixes is infinite.