scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
109 stars 23 forks source link

Add language-specific tests #15

Closed noviluni closed 4 years ago

noviluni commented 4 years ago

We should add tests for the "officially" supported languages, that are (at this moment): Hindi, Spanish, Russian (English have already tests).

noviluni commented 4 years ago

Hi @arnavkapoor!

Here a list of 20 numbers in Spanish that should be included in the language tests. I don't expect all of them to be working in the first version, especially for some single word examples, and we will probably need to add them in the supplementary_data file, but don't worry about this.

The tests could be written with @pytest.mark.parametrize and for those failing right now don't worry, just add them with xfail (https://docs.pytest.org/en/latest/skipping.html#skip-xfail-with-parametrize).

1_432_524,"un millón cuatrocientos treinta y dos mil quinientos veinticuatro"
302,"trescientos dos"
5_000_320_000_000,"cinco billones trescientos veinte millones"
3_023_001_432,"tres mil veintitrés millones mil cuatrocientos treinta y dos"
101,"ciento uno"
5_764_607_500_000_000_031,"cinco trillones setecientos sesenta y cuatro mil seiscientos siete billones quinientos mil millones treinta y uno"
31,"treinta y una"
26,"veintiséis"
424,"cuatrocientos veinticuatro"
1_000_000_000,"mil millones"
1_000_000_000,"millardo"
342,"trescientas cuarenta y dos"
3000000024,"tres mil millones veinticuatro"
10**24,"cuatrillón"
256,"Doscientos cincuenta y seis"
666,"seiscientos sesenta y seis"
2_147_483_647,"Dos mil ciento cuarenta y siete millones cuatrocientos ochenta y tres mil seiscientos cuarenta y siete"
10**100,"Gúgol"
10**600,"centillón"
100000,"Cien mil"

Let me know if you have any doubt.

P.S: I put the underscores in the numeric literals (PEP 515) to improve the readability. This could be an issue if we pretend to support Python 3.5, as it is only supported from Python 3.6. However, the support for Python 3.5 will finish in September and I expect to deprecate the support in dateparser, so we could avoid supporting Python 3.5. I'm not sure about this, but I think that we can use underscores and, if we finally decide to support Python 3.5, we will change them.

noviluni commented 4 years ago

I’ve been thinking about how we could extensively test our library against a language to be able to affirm that the library officially supports it.

My first ideas were:

Doing the first it’s easy, but doing the second is not as easy. We can’t test all numbers, so we need to select a set of different numbers to check. After some time doing different, crazy, things, I got this list:

1234
23451
345612
4567123
56781234
678912345
7890123456
89091234567
909812345678
987123456789
98761234567890
876512345678909
7654123456789098
65431234567890987
543212345678909876
4321123456789098765
32101234567890987654
210012345678909876543
1000123456789098765432
1234567890987654321
12345678909876543210
123456789098765432100
1234567890987654321000

It’s created by appending the next digit to the end and then moving the first digit to the end (and then doing the same but with the digit before).

It’s not perfect, but I think that it’s diverse enough to be able to say that if the parser works for these cases, it will probably work for the most common existing combinations.

What do you think guys? @arnavkapoor @lopuhin @kishan3

I built some spiders to scrape different websites to get these numbers in words and added here the datasets: https://github.com/noviluni/numbers-data

Sources:

Unfortunately, they don't support Hindi, but I can search a website supporting Hindi and create a new spider if it's necessary.

If you like this idea, we can use these CSVs directly in the tests (don't worry about this, @arnavkapoor, I could show you how I would do it).

In case you have another idea of input numbers for the tests, I can generate for you a dataset for a lot of locales with the numbers you want.

Let me know what you think or if you have any other idea/approach. :slightly_smiling_face:

arnavkapoor commented 4 years ago

@noviluni this is a great idea for the library , it can act as a basic threshold to say we support a language. ( and then language specific edge cases as you mentioned in the third point can be later added). One thing, is that this can be used only for testing parse_number function not the parse function. However since both are heavily dependent on the same common _build_number most of the cases should be handled.