nlplab / brat

brat rapid annotation tool (brat) - for all your textual annotation needs
http://brat.nlplab.org
Other
1.82k stars 509 forks source link

issues with sentence breaking #886

Closed andorm closed 12 years ago

andorm commented 12 years ago

I am annotating a corpus that has multiple instances of the string "U.S." as in "U.S. Marshals", "U.S Attorney", "U.S. District Judge" (see paragraph below) and all of these are broken up into individual sentences by Brat. It thinks that "U.S." is a sentence. I am loosing a lot of entities and rels because it's not letting me annotate across sentence lines. Any ideas? thx.

Sample paragraph: "Mr. Stanford was literally left with only the suit he was wearing at the time of the SEC Agents and U.S. Marshals seizure of property ..." the lawsuit said. A spokeswoman for the U.S. Attorney in Houston declined to comment on the lawsuit, as did a spokesman for the U.S. Securities and Exchange Commission. U.S. District Judge David Hittner, who ruled that Stanford is incompetent to stand trial in his current mental state, has also declared the former billionaire indigent.

ghost commented 12 years ago

@andorm: In v1.3 there will be a variable to turn off the sentence splitter and only rely on the newlines in the source file. A work-around for v1.2 is described in issue #777. Most sentence splitters make mistakes, but admittedly the default one in brat is geared towards biomedical text and we should consider implementing a more general one.

andorm commented 12 years ago

thanks for the reference, i will try that.

spyysalo commented 12 years ago

Related to #786

ghost commented 12 years ago

Closing, see #786 instead.