spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.31k stars 645 forks source link

True Casing #1091

Closed ItIsSeven closed 4 months ago

ItIsSeven commented 4 months ago

Hello, I've tried a fair bit to see how I can true case with this library. I've tried with normalizing and also with the casing, but can't figure out if this is something this supports, or if this is something that can be added to the API in the future..

I want to be able to input an all uppercase sentence so it can reverted back to it's original case, where verbs/nouns/names/places etc are correctly capitalized. I haven't seen any mention of this across the docs, discussion or issue section of this repo.

spencermountain commented 4 months ago

ooh, that's a very interesting idea. I would do something like this: https://runkit.com/spencermountain/65ca22696b87a50008841654 obviously that's not full APA style or anything, but it should get you started cheers

spencermountain commented 4 months ago

this is a cool project, and let me know if you'd like to turn it into a plugin, once it gets working. There are a bunch of edge-cases, i'm sure, that would need to get worked-out, but fun stuff.

MarketingPip commented 4 months ago

@ItIsSeven & @spencermountain - I have something awhile back for doing this. Tho it's not battle tested & needs LOTS more rules + Spencer probably has better way of doing things with more knowledge of Compromise API.

But I'll open something up shortly & if either of you wanna pluck away. Feel free.

MarketingPip commented 4 months ago

@spencermountain - found this via old issue Cap Rule Set.

Could be useful. (if you decide you wanna pick this up)

MarketingPip commented 4 months ago

@spencermountain - can you possibly give me an idea of proper usage for using groups like this?

.match("[government|president] of [#Country]")

Not sure how to use | with groups to simplify rules, we could then go through ORG words etc. And do something like this and then apply some context rules.

{pattern:"[government|president] of [#Country]", matches:2}
spencermountain commented 4 months ago

hey - sure no prob. the OR logic uses () brackets, like foo (bar|baz) capture groups use [] brackets, with an optional name, like doc.match('foo [<two>bar]', 'two')

these two features can be combined, so that you grab either 'bar' or 'baz', like so:

doc.match('foo [(bar|baz)]', 0)
//or
doc.match('foo [<two>(bar|baz)]', 'two')

happy to help, if I can clear things up further cheers

spencermountain commented 4 months ago

ps - yeah, that tagging file is really neat, isn't it?

NNP NN PREV2WD ESTATE seems like it would map to #Pronoun #Noun estate in compromise jargon. I'm sure there's a lot we could learn from that dataset

MarketingPip commented 4 months ago

@spencermountain - could you give me a better idea of how to do this?

import nlp from "https://esm.sh/compromise"

function CapitalizeWords(text){

  let doc = nlp(text)
  let finalText = null;
  function applyRule(rule){

     const groups = doc.match('[(government|president)] of [#ProperNoun]').groups()
for(let item in groups){
 finalText = finalText.replace(groups[item].text(),nlp(groups[item].text()).toTitleCase().text())
}
return finalText 
  }

  function goThroughRules(){
    const rules = ['[(government|president)] of [#ProperNoun]']
    finalText = doc.text() 

    for(let item in rules){
       finalText = applyRule(rules[item])
      console.log(finalText)
    }
    return finalText
  }

  return goThroughRules()
}
console.log(CapitalizeWords("The government of canada is amazing")) //

so we can easily do this?

console.log(CapitalizeWords("The government of canada is amazing and so is the president of america")) //

When currently using:

    const rules = ['[(government|president)] of [#ProperNoun]']

It doesn't work. (Assuming we have to write them as single rules?) if so possible feature request for matcher?

Then we could easily make something based off current org words etc.....

plus as said - this should help big time with NLP by Title Casing, then checking for common nouns etc... And having tags for title case words only... Example House of commons.

spencermountain commented 4 months ago

sure, i'd do something like this:


let rules=[
{match:'house of [.]', group:0}
]

rules.forEach(obj=>{
   let m = doc.match(obj.match, obj.group)
  if(m.found){
    m.toTitleCase()
  }
})

cheers

MarketingPip commented 4 months ago

@spencermountain my bad - this is what I was looking for. Keeping this here for reference for me, you and @ItIsSeven


function CapitalizeWords(text){

  let doc = nlp(text)
  let finalText = null;
  function applyRule(rule){

     const groups = doc.match(rule).groups()
for(let item in groups){
  const words = groups[item].json()

  for(let word in words){
    word = words[word].text

     finalText = finalText.replace(word,nlp(word).toTitleCase().text())
  }

}
return finalText 
  }

  function goThroughRules(){
    const rules = ['[(government|president)] of [#ProperNoun]']
    finalText = doc.text() 

    for(let item in rules){
       finalText = applyRule(rules[item])
      console.log(finalText)
    }
    return finalText
  }

  return goThroughRules()
}
console.log(CapitalizeWords("The government of canada is amazing and the president of canada but the president is not")) 
// Outputs: "The Government of Canada is amazing and the President of Canada but the president is not"

Spencer - if you wanna go at this, I am down. I have been training AI for this to make AI rule set for this once I finally figured it out to try and contribute something useful actually to this project instead of my sh*tty issues you probably wanna punch me in the face for lol!