scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
108 stars 23 forks source link

Preventing tokenization twice in parser.py when finding the valid language through token #71

Closed dhananjaypai08 closed 2 years ago

dhananjaypai08 commented 2 years ago

This PR deals with optimization in number_parser/parser.py in the parse function.

When the language is default None we get the best language through _valid_tokens_by_language and in that process we end up tokenizing(_tokenize) the input_string and then after getting the best language we then again tokenize the input_string resulting in tokenization of the input_string twice , whereas we have already have ended up finding the tokens of that input_string when language not specified. And for when language is specified we use the flag for checking if it's tokeniized or not while language not specified.