Closed retorquere closed 2 months ago
Alternately, is there a way to distinguish between Grain-Based
(to be treated as one term) and formulae-as-types
(sensibly treated as 3)?
hey Emiliano, you're right. this is opaque, and should be made a lot clearer. There are a handful of prefixes that the tokenizer treats as single-word. These are mostly just made-up, and you can kill them off like this:
nlp.world().model.one.prefixes = {}
nlp(`The multi-part formulae-as-types notion of construction`).debug()
// [the, multi, part, formulae, as ...]
I will try to add these to the docs now cheers
And the reverse? If I want each hyphen-separated words as one unit?
you can add anything to the prefix model - it's just a key-value object:
nlp.world().model.one.prefixes.myprefix = true
cheers
The issue is I have no control over the input text, so the list of possible prefixes is infinite.
Given the text
compromise/one tokenizes
multi-part
as one term andformulae-as-types
as 3. Can I make compromise tokenize all dash-separated words as it doesmulti-part
? Or can I reconstruct that from the terms?