microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.
MIT License
1.82k stars 126 forks source link

Why personal title abbreviation split #81

Open chunggeonlee opened 3 years ago

chunggeonlee commented 3 years ago

When using BlingFire, the sentence cannot be separated normally if there is title with person name.

For Example >>

(1) Original Text : On July 26, 2013, Michael G. Spinozzi, President of Sally Beauty Supply, notified the Company of his retirement from the Company with an anticipated effective date of November 8, 2013. Mr. Spinozzi has served as President of Sally Beauty Supply since 2006 and the Company is grateful for Mr. Spinozzis leadership and commitment to the success of the Company during his tenure as an officer of the Company.

Result:

:: On July 26, 2013, Michael G. Spinozzi, President of Sally Beauty Supply, notified the Company of his retirement from the Company with an anticipated effective date of November 8, 2013.

:: Mr.

:: Spinozzi has served as President of Sally Beauty Supply since 2006 and the Company is grateful for Mr. Spinozzis leadership and commitment to the success of the Company during his tenure as an officer of the Company.

(2) Original Text : The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer. Ms. Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.

Result:

:: The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer.

:: Ms.

:: Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.

My Code >>

fn = lambda x : blingfire.text_to_sentences( sentence ).split('\n') y = fn('Original Text : The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer. Ms. Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.')

Is there any particular problem with this?

thanks

SergeiAlonichau commented 3 years ago

I will have it fixed.

Also feel free to fix it yourself, if you have time the current patterns for sentence breaking are here: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/sbd .

Feel free to create an alternative model with a new directory name and patterns learned from text. You can (if you want) try to use pattern induction module I made for hyphenation, see syllab.bin model compilation for details, fa_build_pats --help for details, or any other way or just manually correct the existing patterns and recompile.