on CRF truecasing - Githubissues

Brucewuzhang commented 4 years ago

Hi, thanks for making this repo. I want to ask whether a CRF model is better than this n-gram model. Becuase I saw that stanford nlp implemented a CRF truecaser.

nreimers commented 4 years ago

Hi @Brucewuzhang I am not sure if you really would need a CRF.

In general, casing is not that complex and it is a quite easy. For most lower words, there is only one correct casing, e.g. correct casing for WhatsApp is always WhatsApp, for Facebook it is always Facebook.

For some words, there might be two casings like apple and Apple. But this difference would not be captured by CRF.

CRF would help if the casing of a word depends on the casing of the neighboring words. But this is not really the case. E.g. in the case of apple/Apple, the casing of apple/Apple does not depend if the word to the left/right is upper or lower case, it depends to what you are referring to (fruit or company).

nreimers commented 4 years ago

Also have a look at the develop branch: https://github.com/nreimers/truecaser/tree/develop

It adds an LSTM true caser that would be able to distinguish between apple and Apple.

Brucewuzhang commented 4 years ago

Also have a look at the develop branch: https://github.com/nreimers/truecaser/tree/develop

It adds an LSTM true caser that would be able to distinguish between apple and Apple.

Thank you. I will take a look at this development branch.

nreimers / truecaser

on CRF truecasing #8