nreimers / truecaser

Language independent truecaser in Python.
Apache License 2.0
161 stars 40 forks source link

on CRF truecasing #8

Closed Brucewuzhang closed 4 years ago

Brucewuzhang commented 4 years ago

Hi, thanks for making this repo. I want to ask whether a CRF model is better than this n-gram model. Becuase I saw that stanford nlp implemented a CRF truecaser.

nreimers commented 4 years ago

Hi @Brucewuzhang I am not sure if you really would need a CRF.

In general, casing is not that complex and it is a quite easy. For most lower words, there is only one correct casing, e.g. correct casing for WhatsApp is always WhatsApp, for Facebook it is always Facebook.

For some words, there might be two casings like apple and Apple. But this difference would not be captured by CRF.

CRF would help if the casing of a word depends on the casing of the neighboring words. But this is not really the case. E.g. in the case of apple/Apple, the casing of apple/Apple does not depend if the word to the left/right is upper or lower case, it depends to what you are referring to (fruit or company).

nreimers commented 4 years ago

Also have a look at the develop branch: https://github.com/nreimers/truecaser/tree/develop

It adds an LSTM true caser that would be able to distinguish between apple and Apple.

Brucewuzhang commented 4 years ago

Also have a look at the develop branch: https://github.com/nreimers/truecaser/tree/develop

It adds an LSTM true caser that would be able to distinguish between apple and Apple.

Thank you. I will take a look at this development branch.