Open arnavkapoor opened 4 years ago
Just to give another approach for German and Dutch, depending on the number of unique tokens, we could do the inverse process, trying to match the tokens with the number
As an example (I didn't think how to implementate it, it's just an idea):
>>> s = 'fünftausendneunhundertdreizehn'
>>> s.replace('fünf', '5*').replace('tausend', '1000+').replace('neun', '9*').replace('hundert', '100+').replace('dreizehn', '13')
'5*1000+9*100+13'
This could reduce the complexity for long numbers.
Why can't we just translate all other languages to English and then just convert them to numbers ? I guess this would reduce the effort. Translation can be done using Googletrans.
Hi @Tejasvinarora0110, sorry for the late answers.
There are multiple reasons to avoid using Google translator:
Languages without delimiters - Japanese and Chinese (Simplified, Traditional) and possibly other east Asian languages don't have any delimiter. eg) 九千九百九十九 (9999 in Japanese). These actually have a very similar structure compared to English but the lack of a delimiter makes it tougher. Also, there isn't a delimiter as such (upto a certain number) for German and Dutch .
One approach in mind for the delimiter thing is reading words character by character and as soon as we have a match in any of the words we insert a space and after this pre-processing step, we can follow the same logic. This does increase the complexity O(string_length ^ 2) which shouldn't be a major issue I believe. (We can use this function only for certain languages without delimiters).
Concrete example
To handle this we first
check f , fü, fün and finally hit fünf = 5
andsimilary get negen = 9
and insert a space and then start again from the next character.