spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.43k stars 656 forks source link

Feature Request: Add data to a term #1066

Closed thegoatherder closed 7 months ago

thegoatherder commented 9 months ago

I'm looking for some kind of method on a View which will enable me to attach some data to words, the same way that I can attach tags. Does anything like this already exist?

Here's a rough sketch:

const doc = nlp('Simon says see you next September')
const match = doc.match('#Month')
const someObj = doSomethingComplexWithMyMatch(match)
match.tag('SomeTag')
// we add some custom data to our Compromise View instance
match.addData(someObj)

// ... somewhere else in the app

const match2 = doc.match('#SomeTag')
// recall the data somehow
const myData = match2.out('data')

I think this could be an extremely powerful feature for our project and I'm sure many others...!

Example Use Case:

spencermountain commented 9 months ago

ya, really cool idea. Agree that tags are limited as data, and have taken us really far - (probably too far). I also like the idea for storing captured metadata, like date metadata, within reach of compromise somehow.

Imagine if we could do something in match queries with the json like:

let doc=nlp('paul, john lennon and ringo starr')
doc.match('ringo starr').payload({roles:['drummer', 'singer'], hair:'long'})

//then later...
doc.match('and {roles:'drummer'}') //or something

Been stuck, forever, on this same dilemma - where to store information about groups of words. The good news is that they are just javascript objects, and we can stick stuff anywhere.

View objects are transient. Every method returns a new one, and would need to marshal any data payload around, with every interaction. Old views would have stale payloads. I don't think it's the right place for this. Putting paylods in Term objects would also be the wrong place - 'ringo' and 'star' would need dangled or duped data between them.

Open to it, just haven't got it clear yet.

thegoatherder commented 9 months ago

@spencermountain

Just throwing some ideas around in case they offer any inspiration... far from a solution...!

What if there was some new layer like compromise/four with a method like .commit() that could commit a View and store it separately in the document.

const someObj = {} // my payload
const view = nlp('See you next September').match('next #Month').commit() 
view.payload(someObj)

.commit() could hash the Term.IDs to generate a deterministic ID for the View on .commit(). This would ensure that a committed View can be later updated with new data if needed.

doc: {
  commits: {
     "somehash1": {
       terms: []  // list of Terms
       payload: {} // the payload data
     }
  }
}

This would allow for Terms to hold different data in different contexts. For example a match of next #Month versus #Month could both attach data to the Term September, but independently. A user could then:

const payload1 = { a: 1 } 
const payload2 = { a: 2 } 
const doc = nlp('See you next September')
doc.match('next #Month').commit().payload(payload1)
doc.match('#Month').commit().payload(payload2)

// ... later in the app
doc.match('next #Month').payload()  // Generate checksum for this match and use it to lookup payload1 data from the commit
doc.match('#Month').payload()  // Generate checksum for this match and use it to lookup payload2 data from the commit

The data could also be output by the .json() function:

doc.match('next #Month').json()
[
  {
    "text": "next september",
    "terms": [
      {
        "text": "next",
        "pre": "",
        "post": " ",
        "tags": [
          "Adjective"
        ],
        "normal": "next",
        "index": [
          0,
          2
        ],
        "id": "next|00700002C",
        "dirty": true,
        "chunk": "Noun"
      },
      {
        "text": "september",
        "pre": "",
        "post": "",
        "tags": [
          "Date",
          "Noun",
          "Month"
        ],
        "normal": "september",
        "index": [
          0,
          3
        ],
        "id": "september|00800003V",
        "chunk": "Noun",
        "dirty": true
      }
    ],
    payload: {}   ***** MY PAYLOAD *****
  }
]

I think, but am not sure, that this might also support your (excellent!) suggestion of a new match syntax based on payloads:

doc.match('and {roles:'drummer'}') //or something

The matcher could simply know that when it sees {roles:'drummer'} that it has to go and find all committed views that have that data, return their term IDs and use those to complete the match like and ringo|00012ABC starr|0A11A00B

spencermountain commented 7 months ago

check out the compromise-payload plugin ⚡