microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.
MIT License
1.83k stars 128 forks source link

Word Tokenization - Unexpected Output #139

Open albertnanda opened 2 years ago

albertnanda commented 2 years ago

Is this expected?

text = '''Mr. G. B. Shaw, known at his insistence simply as Bernard Shaw, was an Irish playwright.'''
print(blingfire.text_to_words(text).split())
print(list(nlp(text))) ##spacy

['Mr', '.', 'G', '.', 'B', '.', 'Shaw', ',', 'known', 'at', 'his', 'insistence', 'simply', 'as', 'Bernard', 'Shaw', ',', 'was', 'an', 'Irish', 'playwright', '.']
[Mr., G., B., Shaw, ,, known, at, his, insistence, simply, as, Bernard, Shaw, ,, was, an, Irish, playwright, .]

The dot(.) in Mr. and G. should be not treated as distinct token, it should be a single token.