Closed jeffrschneider closed 6 years ago
Another case is failed sentence splitting on addresses:
"Bob's address is 3715 1st Street apt. 122, Austin, Texas 32901" common abbreviations for addresses include: rt. Rt. (route) Apt. apt. (apartment) Cv. cv. (cove) Bldg. bldg. (building) St. st. (street) Blvd. (blvd.) (boulevard) ... for more examples, see: https://wiki.acstechnologies.com/display/ACSDOC/Common+Approved+Address+Abbreviations
At the end of the day, tokenization and sentence splitting in CoreNLP is currently rule-based with the former being a finite automaton, so there are some limits to what can be done. Some of these things could really only be attempted by a machine learning model that uses more context....
So, I don't really see how the Apr 10 issue can be addressed within the basic framework that exists now, so we're punting on that one. But it is possible to add abbreviations and on balance to considerably improve things, though there are there too issues of precision and recall. I've added several new abbreviations, including for Apt., Rt., and incl.
So thanks for pointing out some that were missing!
Sentence splitting is failing on enumerated lists.
Test cases: "Bob ate three things: 1. a pizza, 2. a pie and 3. a cookie." "Bob ate three things: (1). a pizza, (2). a pie and (3). a cookie."