scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
105 stars 22 forks source link

Add support for other numeral systems. #18

Open noviluni opened 4 years ago

noviluni commented 4 years ago

At this moment, the main number-parser goal is to return the number equivalences from different languages, but only when those words are representing the number using the "decimal numeral system" (https://en.wikipedia.org/wiki/Decimal).

However, there are some numeral systems that don't rely on the decimal numeral system and uses other structures. That's the case of the Roman Numeral System (https://en.wikipedia.org/wiki/Numeral_system) or the Chinese/Japanese/Korean/Vietnamese Numeral System (https://en.wikipedia.org/wiki/Chinese_numerals and https://en.wikipedia.org/wiki/Suzhou_numerals).

We could probably add support for them in a future version, as they will probably need another kind of parser.

For more on this, you can also check this: https://en.wikipedia.org/wiki/Numeral_system

noviluni commented 3 years ago

Here ("Using neither" section): https://en.wikipedia.org/wiki/Long_and_short_scales#Using_neither there is a list of different numeral systems.

noviluni commented 3 years ago

BTW, this resource has a lot of useful information: https://www.languagesandnumbers.com/site-map/en/

AmPhIbIaN26 commented 3 years ago

@noviluni I had asked Adrian about this issue he advised me that I could add features in number parser, so would you recommend to take this issue for my GSoC 2021 prposal?

AmPhIbIaN26 commented 3 years ago

@noviluni i have decided to take this for my GSoC proposal, I understood how one can wrote roman numerals in English alphabet and how one would implement it, I wanted to ask about Chinese/Japanese/Korean/Vietnamese numeral system. What I think is that it would parse the number which are input in symbols form the respective languages, or do I do it with the help of Unicode?

Gallaecio commented 3 years ago

What I think is that it would parse the number which are input in symbols form the respective languages, or do I do it with the help of Unicode?

I did not understand this question. To me “symbols form” and “Unicode” are basically the same here, the input would be the symbols as a Python string.

AmPhIbIaN26 commented 3 years ago

I meant to ask was should i be using "零" or "U+96F6", cause i tried some test code in python and it did print "零" as it is. I assumed if one was to use this feature and take input from user or any other application which uses number parser they would use it in "零" form, so that it would be directly handled instead of being converted to Unicode first.

noviluni commented 3 years ago

Hi @AmPhIbIaN26

thanks for showing interest in fixing this, yes, I think this could be feasible for a GSoC proposal.

I'm not sure about your last question. In Python 3, unicode is enabled by default, so you don't need to handle it. 零 is unicode the same way than "U+96F6". I don't think you need to convert anything.

AmPhIbIaN26 commented 3 years ago

Ohh ok thanks, I'll work on it

AmPhIbIaN26 commented 3 years ago

@noviluni I am doing a bit of research on this topic first, so do you only want conversion of numbers to int or a readable string. Or maybe could you give me examples of what kinds of input do you want it to take and what should it return.

AmPhIbIaN26 commented 3 years ago

i looked up on how to parse roman numerals, it can be added to the current parser, but for the case of these other numerals for Chinese Japanese and other languages should I create a new parser??

Gallaecio commented 3 years ago

do you only want conversion of numbers to int or a readable string.

If you mean whether the goal is to support new numerical systems in parse_number or parse, I would say that ideally both. And also parse_ordinal and parse_fraction if relevant.

That said, maybe it makes sense to prioritize parse, the one that replaces the numbers in-place in the strings, as I imagine that is the method more prone to be used by the likes of Dateparser and price-parser.

maybe could you give me examples of what kinds of input do you want it to take and what should it return.

It’s hard to give specific examples without some knowledge of those other numeral systems, but for Roman numbers I imagine something like:

>>> parse('Built in MDCCLXXVI')
'Built in 1776'
>>> parse_number('MDCCLXXVI')
1776

i looked up on how to parse roman numerals, it can be added to the current parser, but for the case of these other numerals for Chinese Japanese and other languages should I create a new parser??

I’m not very familiar with the internals of number-parser, but I would aim for it to be possible for users to limit which numeral systems are considered by number-parser in a given call. For example, allow users to use number-parser functions limiting them to decimal numbers, roman numbers, or any combination or numeral systems.

So I would say that, ideally, the parsers should be as independent as possible.

AmPhIbIaN26 commented 3 years ago

So I add roman to parser and then make a new parser for the other languages, right?

Gallaecio commented 3 years ago

So I add roman to parser and then make a new parser for the other languages, right?

That’s the opposite of what I meant, but I’m also starting to think that I may be misunderstanding what you mean here by “parser”.

How would each of the options (extending the existing parser with Roman number support vs adding a separate parser for Roman number support) look like for users, API-wise? Are you talking about creating separate user-level functions for other numeral systems, as opposed to have the existing functions like parse and parse_number to support them?

AmPhIbIaN26 commented 3 years ago

parsing roman is not a big or complex task and can be added directly to parser.py, what I meant by a different parser for other languages was that to create a new file for that, since I am new to this whole concept of parsers and making a python library in general I might be asking the wrong question.

I am not confused on how to integrate, that part is done will make a pull request for it soon.

What i am confused about is how would I integrate other languages, like do you want it to be

>>>parse('百四十五')
'145'

or do you want it to be like

>>>parse_japanese('百四十五')
'145'

So I can add a way to detect language and then parse accordingly or let the user define what language are they parsing.

I asked about creating a new parser was because of this comment by @noviluni

We could probably add support for them in a future version, as they will probably need another kind of parser.

Gallaecio commented 3 years ago

I personally have no strong feeling either way on how to distribute the new code in the code base. Having separate files for the code that supports each numeral system (what we call parsers) would make sense to me, but I would not worry too much about that.

As for integration of other languages, my suggestion would be to aim for reuse of the existing functions, which is best when you don’t know which numeral system the input uses. So, parse('百四十五'). There’s currently a language parameter, so I’m thinking that, when the parameter is other than None, we could disable irrelevant numeral system parsers to improve performance; for example, if language='en' is passed, we could limit the parsing to decimal and Roman numbers. In addition to that, we could add a parameter to allow enabling only specific parsers.

AmPhIbIaN26 commented 3 years ago

I'll look into it, will make a pull request for roman numeral by tomorrow maybe and was thinking of taking other numerals in my proposal, how does that sound?

Gallaecio commented 3 years ago

I'll look into it, will make a pull request for roman numeral by tomorrow maybe and was thinking of taking other numerals in my proposal, how does that sound?

Sounds great!

AmPhIbIaN26 commented 3 years ago

Thanks a lot, I am excited to work on this!!

AmPhIbIaN26 commented 3 years ago

I worked on a way for parse_number, where you have to set language='rom', I did this because unlike other languages where you can add the ones tens and hundreds to get the number, roman numbers are different. I had to make a dictionary in the parse_number function, because I couldn't find a way to add these in lang_data, I will still try to implement it with language=None.

So the method I used to parse roman is:

>>>parse("XVIII", language='rom')
'18'

As for parse function I was thinking even though you have to specify language='rom', you will get

>>> parse('Built in MDCCLXXVI', language='rom')
'Built in 1776'
Gallaecio commented 3 years ago

It sounds good as a temporary workaround, but we should not treat this as a language, since “rom” is no ISO language. There’s Latin (la), but I don’t think this parser should be limited to that language, or even do anything special for that language.

I take it you went the “rom” route for now due to implementation limitations. But to discuss how to best address those, and how to refactor the code to not require “rom”, it may be best for you to create a pull request with your work so far, so that we can discuss this over actual code.

AmPhIbIaN26 commented 3 years ago

Ok, I'll see a work around to it.

AmPhIbIaN26 commented 3 years ago

Hey @Gallaecio hope you're doing well! I have implemented parsing roman numerals, and have made a pull request for it. So the parse_number() now works like this:

>>>parse_number('MMCDXX')
'2420'

for the case for parse() this works

>>> parse('Built in MDCCLXXVI', language='rom')
'Built in 1776'

But this doesn't,

>>> parse('Built in MDCCLXXVI')
'Built in 1776'

I have to make changes to the _valid_tokens_by_language(input_string) function to recognize its roman and also if it encounter a 'i' or 'I', since it is a pronoun and 1 in roman it changes it to 1. I will work on these once submit my proposal.

As of now I will be working on my proposal and submit a draft by tomorrow, I would be obliged if you could go through it.

noviluni commented 3 years ago

Hi @AmPhIbIaN26 and @Gallaecio!

Thank you both for going through this interesting conversation.

I checked the PR and you did a really good job looking at the code and understanding how it works.

So now that you have some practical knowledge, let's see why the chosen approach (reusing the existing parser) doesn't work.

The "parser" you are using is for decimal numbers. That means, that numbers are build in a next way:

  1. We have "units numbers" (from 1 to 9)
  2. We have numbers from 10 that are usually build like "ten" + "unit number"
  3. We have numbers from 20 that are usually build like "number + ten" + "unit number"
  4. We have numbers from 100 that are usually build like "hundred" + "number + ten" + "unit number"

[This is easier to understand in higher numbers like thousand, hundreds, etc., because for the small numbers every language has been evolved differently and doesn't follow the rules. That's the reason why we have the "DIRECT_NUMBERS".]

The Roman Numeral System doesn't work like that. It has some limited symbols (I, V, X, L, C, D, M) and the numbers are written adding and subtracting:

I --> 1
II --> 1+1 --> 2
III --> 1+1+1 --> 3
IV --> 5 - 1 --> 4 ("one less than five")
V --> 5
VI --> 5 + 1 --> 6
VII --> 5 + 2 --> 7
VIII --> 5 + 3 --> 8
IX --> 10 - 1 --> 9 ("one less than ten")
X --> 10
XIV --> 10 + 5 - 1 --> 14
XX --> 10+10 --> 20
CDXLIV --> 500 - 100 + 50 - 10 + 5 - 1= 444

The rule is basically that you shouldn't repeat the same symbol more than three times.

In the PR you submitted you have been reusing the existing structures for Decimal systems:

"UNIT_NUMBERS": {
        "i": 1,
        "ii": 2,
        "iii": 3,
        "iv": 5,
        "vi": 6,
        "vii": 7,
        "viii": 8,
        "ix": 9
 }

And of course, they don't fit well.

So if you want to continue with this, I would like to suggest you to:

  1. Submit your GSoC proposal (don't forget to do it!)
  2. Open a new, different branch and start building a brand new function. Let's call it "parse_roman()". Don't worry about integrating it in the existing parse() or parse_number() functions. It should be able to perform the logic I explained above (and parse numbers from I (1) to MMMCMXCIX (3999)). We will iterate over it until integrating it to the main code :slightly_smiling_face:.
AmPhIbIaN26 commented 3 years ago

Thanks a lot @noviluni for this review of my work. I will look into it. I have started working on my proposal and is almost completed. I was thinking along side with this I could also work on this issue. I have done a fair bit of research on the Suzhou numeral system, it has substantial amount of work to do, but having more issues/ideas in my proposal will give more weight to my work.

As of now I have come up with this on my research on the Suzhou numeral system.

If you could take a look at it then it would be great.

noviluni commented 3 years ago

Hi @AmPhIbIaN26, you can definitely work on issue #41, however, the scope of that issue is really big, so I would focus only on one of the issues mentioned, like trying to fix the German issue (revert numbers) or the French issue ("quatre-vingt").

For Suzhou, I'm not an expert, but the research looks good to me. Probably a good starting point would be writing some tests with that data and try to develop the parsing function by doing some TDD. I would focus only on one variation (maybe Chinese) and then continue with the other variants (Japanese, Korean, Vietnamese).

In case we need it, I know people from France, German, and China, so I could probably ask them to review and provide some feedback if we don't know how to continue or if we have doubts.

AmPhIbIaN26 commented 3 years ago

Thanks @noviluni for the support, I will take up German issue(revert numbers) along side with Roman and Suzhou Numeral.

AmPhIbIaN26 commented 3 years ago

Hi @noviluni and @Gallaecio , I would be obliged if you could take a look at my draft proposal and suggest changes. Looking forward to hearing from you. Thank you.

Gallaecio commented 3 years ago

@AmPhIbIaN26 The technical parts of the proposal look good to me, and as @noviluni said we can already see that you’ve gotten familiar with the code base.

But the timeline looks wrong: Google Summer of Code 2021 will only be made of 10 weeks. Please, update your proposal with a new timeline. I haven’t had a detailed look at your timeline because of this issue, but do remember that you do not need to fit all your goals into the timeline; it’s better to be pessimistic with the time estimations and set some stretch goals in case you work faster than estimated, than to reach the mid-project evaluation behind of schedule.

And just in case, remember that the deadline is April 13, 2021 20:00 (Central European Summer Time), in less than 24h, so try to fix and submit the application as soon as possible.

AmPhIbIaN26 commented 3 years ago

Thanks for the suggestion, I have changed the timeline also made changes to the deliverables. I have also added precedence study. I have submitted my final proposal.

AmPhIbIaN26 commented 3 years ago

@noviluni I will now work on creating on the new parse_roman() function. Will make a pull request once I'm done.

AmPhIbIaN26 commented 3 years ago

Hi @noviluni and @Gallaecio hope you both are doing well and are safe. I have implemented parse_roman() function in the parser. It can now parse:

>>>parse_roman('CDXX)
'420'

>>>parse_roamn('Built in MMLXXVII.')
'Built in 2077.'

I have have a pull request for it. I have also added test cases to it.