mideind / GreynirEngine

A fast, efficient natural language processing engine for Icelandic.
https://greynir.is
Other
60 stars 10 forks source link

Tokenization of written numbers #45

Closed sultur closed 2 years ago

sultur commented 3 years ago

Currently there is an issue with terminals of the form {nationality} {currency} (e.g. "breskra punda", "indónesískra rúpía") where their number/case is incorrectly identified. (See test_parse.py::test_amounts).

We may also want to add large numbers ("kvaðrilljarður" and upwards) to our BÍN vocabulary.

sveinbjornt commented 2 years ago

Is this good to merge, @vthorsteinsson ?

vthorsteinsson commented 2 years ago

There are still some commented-out currency names (bresk pund, indónesískar rúpíur) in the tests, so I'm not sure whether this is fully tested and vetted yet. @sultur

sultur commented 2 years ago

I wasn't quite happy with this implementation, as it overwrites the default behaviour of the BÍN tokenizer (and caused a bunch of tests to fail). I am working on another one which isn't as drastic, instead having an optional flag to turn off the functionality of combining written numbers (and also fixes the kró bug). We can probably just close this pull request without merging it and I'll try to create the new pull request asap (which should be smaller).