stanfordnlp / python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.
MIT License
516 stars 105 forks source link

Add support for tokensregex/semgrex/tregex #4

Closed dan-zheng closed 7 years ago

dan-zheng commented 7 years ago

Description

This branch adds basic support for tokensregex/semgrex/tregex. Users can perform these regex queries via methods exposed by the CoreNLPClient object.

For tokensregex and semgrex, users can enable a to_words flag that will convert the output from the default sentence-separated format to a flat list of mentions.

(Note: to_words only causes the top-level matches to be flattened. For tokensregex queries that have nested matches, only the topmost-level matches are flattened and the nested matches are untouched.)

Examples

tokensregex Demo

annotators = 'tokenize ssplit ner depparse'.split()
client = corenlp.CoreNLPClient(annotators=annotators)

# Example pattern from: https://nlp.stanford.edu/software/tokensregex.shtml
text = 'Hello. Bob Ross was a famous painter. Goodbye.'
pattern = '([ner: PERSON]+) /was|is/ /an?/ []{0,3} /painter|artist/'
matches = client.tokensregex(text, pattern)
print(json.dumps(matches, indent=2))

Output:

{
  "sentences": [
    {
      "length": 0
    },
    {
      "0": {
        "text": "Ross was a famous painter",
        "begin": 1,
        "end": 6,
        "1": {
          "text": "Ross",
          "begin": 1,
          "end": 2
        }
      },
      "length": 1
    },
    {
      "length": 0
    }
  ]
}

semgrex Demo

annotators = 'tokenize ssplit depparse'.split()
client = corenlp.CoreNLPClient(annotators=annotators)

text = 'I ran.'
pattern = '{} < {}'
matches = client.semgrex(text, pattern, to_words=True)
print(json.dumps(matches, indent=2))

Output:

[
  {
    "text": ".",
    "begin": 2,
    "end": 3,
    "sent": 0
  },
  {
    "text": "I",
    "begin": 0,
    "end": 1,
    "sent": 0
  }
]

more on the way