Normalization metrics? - Githubissues

t-shoemaker commented 3 years ago

First off, this is a great library. I've been quite impressed with the results of its normalization functionality. One thing I'd be keen to see, however, is some kind of metric with which to associate the topn candidate words produced by the normalization process (say, for instance, vector similarity or the NMT model's prediction score). Is Natas already capable of doing this (in which case I'm missing something), or are there plans to implement such functionality?

mikahama commented 3 years ago

I have been thinking about the NMT prediction score, but I did not find a way to get it out of OpenNMT. There must be a way, but the library is mostly documented if you want to run it on a terminal. For using it as a library, there isn't too much documentation, but if you are interested in figuring out how to get the prediction score out, that would be great.

Anyways, I think that the best way of determining the right normalization candidate would be to do it contextually. Currently, Natas only does normalization one word at a time. You could use a language model to rank the output normalizations in a sentence to pick the ones that seem to form a sentence that makes the most sense.

t-shoemaker commented 3 years ago

Thanks for the response! After some hunting around, it looks like OpenNMT models will indeed output a prediction score, which we can capture (you're right about the library's documentation being sparse). I'll open a PR for my attempt at doing so. I'm not very familiar with this particular translation model, however, so if you catch any problems, let me know and I'm happy to tweak things or consult further.

I like your idea of using another model to test for normalization validity. Seems like there would be interesting work to be done to determine whether a large model trained on contemporary language (like base BERT) would work well in that scenario, or if you would instead need to train something from scratch (all of EEBO-TCP, for example). I suspect it would depend on what you use for test sentences and whether you prioritize corpus coherence or matching with contemporary orthography.

mikahama / natas

Normalization metrics? #6