Simple arithmetic for the words

Allactaga commented 9 years ago

We need to be able to transform token sequences like "seven hundred and sixty-five thousand, four hundred and thirty-two" to the "765432". There could be different handling of such tokens in different languages (for example Roman numerals deals with subtractions). So let's for now only focus on how English tokens transforming to numbers. Let's call this approach "general" (later we would define which approach should be used in languages.yaml file) Initial idea is to iterate through the list of tokens, skipping tokens that are in skip, or [\W_]+. Each token should be present in dictionary (numbers section of the language).

So if number represented by current token is less then previous, we use addition, if it is greater than several of previous nearby numbers, than those smaller number are describing this bigger one and use multiplication. Be sure to use multiplication only with those preceding number that are 1) less then current 2) directly chained with current.

This approach should of course be properly tested.

Gallaecio commented 4 years ago

There is a great potential for code sharing between dateparser and price-parser here. I’ve recently proposed an English-only approach for price-parser (https://github.com/scrapinghub/price-parser/pull/11).

Time for number-parser? :grin:

Eveneko commented 4 years ago

Hi, I'm interested in this idea. However, I have NLP course this semester. I'm not sure if it is a bit late, But I really want to participate, and I believe I have the ability to start work when I start work in the summer.

varunagarwal18 commented 4 years ago

There are many implementations for number parsers on Stack overflow. There is also a library called word2number in Python contributed by someone.

Gallaecio commented 4 years ago

There are many implementations for English. But the end goal is to support different locales. And in unambiguous cases, without the knowledge of the locale of the input text. That can be hard.

ShantanuDube commented 4 years ago

There are many implementations for English. But the end goal is to support different locales. And in unambiguous cases, without the knowledge of the locale of the input text. That can be hard.

So what should be the main aim of the project... to solve this issue with respect to natural language processsing in english or in all different locales?

noviluni commented 4 years ago

Hi @ShantanuDube @varunagarwal18 @Eveneko !

The idea is to support every supported language, however, if you check the code, most of the things are first translated to English and then processed to get the date. So this could be done as "X language" --> "English" --> "numbers". The first natural step would be "English" --> "numbers", but we also need to develop a "framework" to easily add support for the other languages.

On the other hand, there are some open PRs trying to address this issue, and even we have some natural numbers directly included in the main code ("one", "two"...). Feel free to investigate it and open issues or draft PRs with ideas. Don't be afraid to code! :smile:

aditya-hari commented 4 years ago

Why can't we use an existing library, like say https://github.com/jduff/numerizer?

Gallaecio commented 4 years ago

Using an existing library is not out of the question, provided that they can be used to achieve the desired goal. Internationalization may be an issue, so that’s something to account for when looking for existing libraries.

They should also be Python libraries or have Python bindings, Ruby libraries are probably not a good fit :stuck_out_tongue:

aditya-hari commented 4 years ago

Okay, I could have sworn that I linked a Python library. A ruby library is not ideal for a python library, yes i tend to agree. Sorry!

heraclex12 commented 4 years ago

I think it just need to use Regex to resolve this. You can see this example https://github.com/facebook/duckling/blob/master/Duckling/Numeral/EN/Rules.hs

Teut2711 commented 4 years ago

This problem can be solved by LSTMs. If we can parse the date in one format from bizarre text then with the help of various parsing libraries we can parse date in any format. But we will need a data like with one column containing all dates (in english or some other language) and another the target date. The language variation shall make the model tough to train but I think it will work if we have sufficient data. Major problem might be with languages like chinese or japaneses which are totally different from english in the way we write them. It doesnt seem parsing can be the right solution when someone wants to write 3 jan 1978 or someone else 3 January '78 and there can exist all different shortcuts in different languages.

asadurski commented 4 years ago

@heraclex12 - True, as this is mostly what is done with the dates, but... see answer https://github.com/scrapinghub/dateparser/issues/46#issuecomment-596978995.

asadurski commented 4 years ago

@XtremeGood - I don't think this is a viable solution. I mean, yes, I believe it would generally work, but:

we would not get the performance we need - and we need it really fast,
it wouldn't run on any hardware (imagine running this in a Flask app on a tiny server),
the size of the library with required libraries to run it would be enormous.

So it's a good approach, just for a separate library.

Teut2711 commented 4 years ago

I thing regex is also slow and python too in that way.

Teut2711 commented 4 years ago

What we can do is to use the 1 D convolution neural nets in place of rnns. I have heard of this approach. Those are even used for mobile devices.

Teut2711 commented 4 years ago

or use this https://spacy.io/

scrapinghub / dateparser

Simple arithmetic for the words #46