spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.31k stars 645 forks source link

Tagging mixed number as #Value #1085

Closed track0x1 closed 4 months ago

track0x1 commented 5 months ago

Mixed numbers are a common way to express a value like ‘1-1/2 cups’ sometimes without the hyphen separator ‘1 1/2 cups’. When I used compromise v11 I was able to make a plugin with a regex to try and tag these as #Value but it doesn’t seem to work in the latest release. Because it’s so common should this be out of the box tagging? My purpose here is to match all types of values (including mixed number values) for capturing.

spencermountain commented 5 months ago

hey Tom, yep - if I remember we still do some of this number-range stuff out of the box, but shied-away from some of it that resembled algebra or subtraction. This is a real doozie, and I agree it's a cool thing to opt-in to, and we should support any unambiguous 'and a half' stuff as much as we can.

You can see some of the fractions tests we pass, and avoid for this here, PRs welcome if you can improve on it, in any way.

ps i enjoyed your blog. cheers

track0x1 commented 5 months ago

@spencermountain Thank you Spencer! I just realized something that looks like a bug. When 15-ounce is wrapped in parentheses it's tagged as a single term and resultantly has the wrong tags.

> nlp('15-ounce (15-ounce)').debug()

  ┌─────────
  │ '15'       - Value, Cardinal, NumericValue, Hyphenated
  │ 'ounce'    - Noun, Unit, Singular, Hyphenated
  │ '15-ounce'  - Infinitive, Verb, PresentTense

sidebar: is there a way we can convert verbose number ranges (2 to 3) to hyphenated number ranges (2-3)? that would enable me to tap into the same #NumberRange tag for a match.

> nlp('2 to 3 people').debug()

  ┌─────────
  │ '2'        - Value, Cardinal, NumericValue
  │ 'to'       - Conjunction
  │ '3'        - Value, Cardinal, NumericValue
  │ 'people'   - Noun, Plural, Actor

> nlp('2-3 people').debug()

  ┌─────────
  │ '[2]'      - Value, Cardinal, NumericValue, NumberRange
  │ '[to]'     - Conjunction, NumberRange
  │ '[3]'      - Value, Cardinal, NumericValue, NumberRange
  │ 'people'   - Noun, Plural, Actor

edit: also happy to split these concerns into separate issues/discussions if you prefer

spencermountain commented 4 months ago

hey Tom, apologies for the delay. yeah, there's an ugly way:

let doc = nlp('2 to 3 people')
let { before, prep } = doc.match('[<before>#Value] [<prep>to] #Value').groups()
before.post('') //remove '2' whitespace
doc.match(prep).replaceWith('-').post('') //remove '-' whitespace
console.log(doc.text()) //2-3 people

in short, some of this is weird. You may benefit from using replace() with some term methods like @hasDash or @hasHyphen

This nlp('15-ounce (15-ounce)').debug() one is a doozie. Haven't got it yet, but will.

spencermountain commented 4 months ago

hey @track0x1 , this is fixed in 14.12.0:

let doc = nlp('10-ounce (12-ounce)')
doc.terms().length // 4

cheers

track0x1 commented 4 months ago

hey @track0x1 , this is fixed in 14.12.0:

let doc = nlp('10-ounce (12-ounce)')
doc.terms().length // 4

cheers

You're the best! Thank you