scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
109 stars 23 forks source link

Adding support for languages with discernible delimiters #40

Open arnavkapoor opened 4 years ago

arnavkapoor commented 4 years ago

Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher. Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .

One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).

Concrete example

five thousand nine hundred and thirteen - English (5913) 
fünftausendneunhundertdreizehn - German (5913)

nine hundred and thirteen - English (913)
negenhonderddertien - Dutch (913)

To handle this we first check f , fü, fün and finally hit fünf = 5 and similary get negen = 9 and insert a space and then start again from the next character.

noviluni commented 4 years ago

Just to give another approach for German and Dutch, depending on the number of unique tokens, we could do the inverse process, trying to match the tokens with the number

As an example (I didn't think how to implementate it, it's just an idea):

>>> s = 'fünftausendneunhundertdreizehn'   
>>> s.replace('fünf', '5*').replace('tausend', '1000+').replace('neun', '9*').replace('hundert', '100+').replace('dreizehn', '13')
'5*1000+9*100+13'

This could reduce the complexity for long numbers.

Tejasvinarora0110 commented 4 years ago

Why can't we just translate all other languages to English and then just convert them to numbers ? I guess this would reduce the effort. Translation can be done using Googletrans.

3
noviluni commented 3 years ago

Hi @Tejasvinarora0110, sorry for the late answers.

There are multiple reasons to avoid using Google translator:

  1. This library is aimed to work offline.
  2. We want to keep the dependencies list as little as possible.
  3. Keeping all languages independent from others (like English) would allow developing concrete solutions.
  4. Avoid using external services will allow improving the performance