nert-nlp / cgel

CGEL trees.
Creative Commons Attribution 4.0 International
6 stars 3 forks source link

move punctuation tokens to following token? #113

Closed nschneid closed 2 months ago

nschneid commented 3 months ago

Policy in original CGELBank:

image

It occurs to me that most common punctuation marks (commas, periods) are written orthographically with the previous word. It is less intuitive to put them in the subsequent-word node, and the tree visualization looks strange:

image

Nodes with punctuation are:

Suggested new policy: all :p annotations will group with the preceding word, except (a) punctuation tokens containing "(", "[", and any series of punctuations following one of those, and (b) sentence-initial punctuation.

Under the new policy, the above sentence would have

Logically quotes could be treated like open parens/brackets, but because " and ' are ambiguous, maybe we shouldn't go there.

Thoughts? @BrettRey @bwaldon

nschneid commented 3 months ago

PR implementing this: #114

BrettRey commented 2 months ago

I have no objections

nschneid commented 2 months ago

Updated guidelines—hope these are clearer:

image image