scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

I have implemented parse_roman() function #64

Open AmPhIbIaN26 opened 3 years ago

AmPhIbIaN26 commented 3 years ago

I have implemented the parse_roman() function.

codecov[bot] commented 3 years ago

Codecov Report

Merging #64 (086f5ec) into master (c834854) will decrease coverage by 0.65%. The diff coverage is 95.16%.

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
- Coverage   98.78%   98.12%   -0.66%     
==========================================
  Files          86       86              
  Lines         328      374      +46     
  Branches       60       78      +18     
==========================================
+ Hits          324      367      +43     
  Misses          1        1              
- Partials        3        6       +3     
Impacted Files Coverage Δ
number_parser/parser.py 97.57% <95.08%> (-0.78%) :arrow_down:
number_parser/__init__.py 100.00% <100.00%> (ø)
Gallaecio commented 3 years ago

Now I think we need to make parse and parse_number use this function as well.

I’m thinking that we may also want to include a new parameter to those functions, numeral_systems, which allows to limit the numeral systems to support while parsing. We could make it so that by default all numeral systems are used, but you can use numeral_systems=['decimal'] to limit parsing to decimal numbers, or numeral_systems=['roman'] to limit it to roman numbers.

For cases where users want to exclude a system, rather than include it, it may make sense to expose a public variable that contains the list of all supported numeral systems, e.g. number_parser.NUMERAL_SYSTEMS, so that users can create an exclusion-based subset (e.g. [system for system in NUMERAL_SYSTEMS if system != 'roman']).

Once these changes are done, I think we no longer need parse_roman itself, and it could be renamed as _parse_roman to discourage people from using it.

I’m also thinking that it may be possible to have slightly different implementations of parse_roman for each of the user-facing number-parser functions, for performance-tuning. But I haven’t look into it in detail, and what’s most important is that things work as intended, we can worry about performance later.

AmPhIbIaN26 commented 3 years ago

I still didn't understand how the user would use this [system for system in NUMERAL_SYSTEMS if system != 'roman']

Let’s say we introduce a numeral_systems parameter to parse_number so that users may use:

parse_number('V', numeral_systems=['roman'])

If we also create a NUMERAL_SYSTEMS with all supported numeral systems, users that want to exclude a numeral system can do:

all_numeral_systems_but_roman = [system for system in NUMERAL_SYSTEMS if system != 'roman']
parse_number('V', numeral_systems=all_numeral_systems_but_roman)

If in the future we add support for additional number systems, that code would still work as intended. Whereas if users had to hardcode the list of systems, new systems added later would also be excluded.

AmPhIbIaN26 commented 3 years ago

Hi @Gallaecio hope you and your family are safe.

I added the _valid_input_by_numeral_system() (in this fork) still have to improve this so it is easier to add another numeral system. I will work on that now and also improving the performance of _parse_roman().

AmPhIbIaN26 commented 3 years ago

I have added numeral_systems as a parameter to parse(), the current system works like this

all_numeral_systems_but_roman = [system for system in NUMERAL_SYSTEMS if system != 'roman']
all_numeral_systems_but_decimal = [system for system in NUMERAL_SYSTEMS if system != 'decimal']

>>>parse('Built in MMLXXVII.')
'Built in 2077.'

>>>parse( 'Built in MMLXXVII.', ['decimal'])
'Built in MMLXXVII.'

>>>parse('I was given two IV injections.', all_numeral_systems_but_roman)
'I was given 2 IV injections.'

>>>parse('I was given two IV injections.', all_numeral_systems_but_decimal)
1 was given two 4 injections.'

>>>parse('I was given two IV injections.')
'1 was given 2 4 injections.'

>>>parse('I have three apples.', all_numeral_systems_but_roman)
'I have 3 apples.'

I have added all these examples as test cases. I wanted to ask that since parse_number() only takes in 1 input how will having numeral_systems as a parameter help?

Gallaecio commented 3 years ago

I have added all these examples as test cases. I wanted to ask that since parse_number() only takes in 1 input how will having numeral_systems as a parameter help?

It could allow users to fine-tune for performance if they know beforehand the numeral system of the input number, by making number-parser only use the desired number parser. Also, if they want parsing to simply fail for something like I, instead of returning 1.

AmPhIbIaN26 commented 3 years ago

Thanks for looking into these, ill make the changes

AmPhIbIaN26 commented 3 years ago

@Gallaecio hope you and your family are safe. I have made all the changes you suggested also added support for incorrect roman numbers.

AmPhIbIaN26 commented 3 years ago

Hi @Gallaecio hope you and your family are safe.

Any thoughts on this?