stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.67k stars 2.7k forks source link

Sentence Splitting Issues #671

Closed jeffrschneider closed 6 years ago

jeffrschneider commented 6 years ago

Sentence splitting is failing on enumerated lists.

Test cases: "Bob ate three things: 1. a pizza, 2. a pie and 3. a cookie." "Bob ate three things: (1). a pizza, (2). a pie and (3). a cookie."

jeffrschneider commented 6 years ago

Another case is failed sentence splitting on addresses:

"Bob's address is 3715 1st Street apt. 122, Austin, Texas 32901" common abbreviations for addresses include: rt. Rt. (route) Apt. apt. (apartment) Cv. cv. (cove) Bldg. bldg. (building) St. st. (street) Blvd. (blvd.) (boulevard) ... for more examples, see: https://wiki.acstechnologies.com/display/ACSDOC/Common+Approved+Address+Abbreviations

manning commented 6 years ago

At the end of the day, tokenization and sentence splitting in CoreNLP is currently rule-based with the former being a finite automaton, so there are some limits to what can be done. Some of these things could really only be attempted by a machine learning model that uses more context....

So, I don't really see how the Apr 10 issue can be addressed within the basic framework that exists now, so we're punting on that one. But it is possible to add abbreviations and on balance to considerably improve things, though there are there too issues of precision and recall. I've added several new abbreviations, including for Apt., Rt., and incl.

manning commented 6 years ago

So thanks for pointing out some that were missing!