nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.24k stars 1.46k forks source link

Add isAsciiDigit to Rune #23465

Open johnnovak opened 3 months ago

johnnovak commented 3 months ago

Summary

Some util functions that are defined for chars are missing for Runes.

isAsciiDigit is one of them.

Description

func isAsciiDigit*(r: Rune): bool =
  ord(r) >= ord('0') and ord(r) <= ord('9')

Alternatives

No response

Examples

No response

Backwards Compatibility

No response

Links

No response

ghost commented 3 months ago

I think that such a function might be a bit too misleading for unicode, a proper Unicode implementation will be harder to do, but is already done in e.g. https://github.com/nitely/nim-unicodedb:

assert utmDigit in Rune(0x00B2).unicodeTypes()

https://github.com/nitely/nim-unicodedb?tab=readme-ov-file#types

johnnovak commented 3 months ago

I think that such a function might be a bit too misleading for unicode, a proper Unicode implementation will be harder to do, but is already done in e.g. https://github.com/nitely/nim-unicodedb:

assert utmDigit in Rune(0x00B2).unicodeTypes()

https://github.com/nitely/nim-unicodedb?tab=readme-ov-file#types

Oh right. Wasn't aware of these extra Unicode numeric runes.

johnnovak commented 3 months ago

Ok, so I've looked into it, and I still think my implementation is fine for 99.999% use cases. Sure, there are all sorts of other interesting Unicode digit code points, but how often do people use these... extremely rarely, I'd say.

In real life, the 0 to 9 ASCII chars are enough.

https://www.fileformat.info/info/unicode/category/Nd/list.htm

Alogani commented 3 months ago

Hello,

Maybe this implementation ?

const allZeros = [48, 1632, 1776, 1984, 2406, 2534, 2662, 2790, 2918, 3046, 3174, 3302, 3430, 3558, 3664, 3792, 3872, 4160, 4240, 6112, 6160, 6470, 6608, 6784, 6800, 6992, 7088, 7232, 7248, 42528, 43216, 43264, 43472, 43504, 43600, 44016, 65296, 4170, 4307, 4358, 4367, 4371, 4381, 4399, 4421, 4429, 4453, 4460, 4467, 4494, 4501, 4549, 4565, 4570, 4597, 5798, 5804, 5813, 7548, 7549, 7550, 7550, 7551, 7700, 7727, 7759, 7829, 8127]
# List of zero converted from https://www.fileformat.info/info/unicode/category/Nd/list.htm

func toDigitImpl(r: Rune): int =
    let codePoint = ord(r)
    for z in allZeros:
        # not a binary search, because first runes are more common
        if codePoint >= z:
            return codePoint - z
    return -1

func isDigit*(r: Rune): bool =
    let digit = r.toDigitImpl()
    digit in {0..9}

func toDigit*(r: Rune): int =
    result = r.toDigitImpl()
    if result notin {0..9}:
        raise newException(RangeDefect, "rune is not a valid digit")

It should need some unit tests case

tersec commented 3 months ago

Ok, so I've looked into it, and I still think my implementation is fine for 99.999% use cases. Sure, there are all sorts of other interesting Unicode digit code points, but how often do people use these... extremely rarely, I'd say.

In real life, the 0 to 9 ASCII chars are enough.

https://www.fileformat.info/info/unicode/category/Nd/list.htm

Nim's standard library is already full of random jank which comes from this approach. Please do this correctly if it's done at all.

Not just some "in real life ..." approach dismissive of what you regard as irrelevant edge cases.

And, if that's too difficult, that's maybe an indication that what you're looking for is conceptually nontrivial, and not just to be casually tossed into the Nim stdlib.

johnnovak commented 3 months ago

Dunno man. I'd personally take a 99.99% solution than no solution, or having to install a library for the 0.01% use case that I'll never need. I guess this is an approach thing, we are all a bit different 😄

Having full support for all the obscure and historical languages and their codepoints seems like a specialised, niche use case to me, not the standard use case.

Thinking about standard use cases vs niche use cases might be a good guide for the stdlib.

I've looked at your Unicode library. It's nice and super handy for someone who needs that level of support. But personally I don't want to pull it in if I only need a simple isDigit on Rune that will work with all languages in actual use.

SolitudeSF commented 3 months ago

If its tailored for your use case why does it have to be in everyone's stdlib?

johnnovak commented 3 months ago

If its tailored for your use case why does it have to be in everyone's stdlib?

Because it's practical and 99.99% of people find it useful, and most people haven't even heard of those obscure languages with those alternative Unicode codepoints, neither will come near them in their lifetime?

Good enough reason? 😆

To be more productive, it's more fruitful to think in terms of levels of support. Something similar to what's described here: https://cldr.unicode.org/index/language-support-levels

I'd imagine supporting most major languages in actual use is a good practical level of support. Which my isDigit does. Then supporting historical and obscure languages, at the expense of added complexity, extra databases, libraries, and whatnot, is the full support, but I don't think many people ever need that level.

Alogani commented 3 months ago

At first, I didn't see the point of adding johnnovak's isDigit to Rune, as it is :

But I do agree with johnnovak that :

So I think a isDigit implementation should be added for rune in std/unicode, either the one from johnnovak with a proper documentation/proper name about its limitation (really important !), either the one I propose (but which may be too specialized for stdlib).

So I would :

tersec commented 3 months ago

Dunno man. I'd personally take a 99.99% solution than no solution, or having to install a library for the 0.01% use case that I'll never need.

This is a false dichotomy. Implementing isAsciiDigit for Rune neither has no solution, nor requires installing a library, as @Alogani alludes to. One (but not the only, or necessarily best) approach is:

import std/[sequtils, strutils, unicode]
func isAsciiDigit(x: Rune): bool = ($x).allIt(it.isDigit)
for digit in Digits:
  doAssert ($digit).runeAt(0).isAsciiDigit
for letter in Letters:
  doAssert not ($letter).runeAt(0).isAsciiDigit

allIt is questionable for this purpose; one can/should exploit UTF-8 encoding more closely. And, this should work for multi-byte UTF-8 encodings too, because of how UTF-8 works. The high bit will be set, et cetera.

This also addresses

No one wants to rewrite the wheel (or download dozens of nimble packages) because simple things are not present in the stdlib

What "dozens of nimble packages" do you refer to here?

This is the ASCII case, of course. But since you've already indicated that's all you care about, literally one line will do it, plus the tests of which I provided some non-exhaustive examples, without having to create an attractive nuisance in the stdlib. One might note that nothing else in std/unicode restricts itself to ASCII. The point of the module is to support Unicode. That means Unicode casing, Unicode spaces/whitespace, Unicode alpha/letters, et cetera.

Thinking about standard use cases vs niche use cases might be a good guide for the stdlib.

You're already planning on using this functionality in https://github.com/nim-lang/Nim/issues/23462 it looks like? Which is also supposed to go in the stdlib. So it wouldn't just be this function, but this natural sort with this ASCII-digits constraints. This illustrates the point, even: you're already, right now, planning on propagating this brokenness to other parts of the stdlib.

I've looked at your Unicode library. It's nice and super handy for someone who needs that level of support. But personally I don't want to pull it in if I only need a simple isDigit on Rune that will work with all languages in actual use.

I don't understand the logic here. Aside from more of the dismissal of other languages ("in actual use"), you're proposing, I think, to add

func isDigit*(r: Rune): bool =
  ord(r) >= ord('0') and ord(r) <= ord('9')

in this very issue, to, presumably (?) the std/unicode library. Which one wouldn't be able to access if you aren't willing to "pull it in".

The difference is only, do you use the $/toUTF8 or other approach (many creative approaches exist) to access Runes from std/unicode, or do you use this hypothetical isDigit. Both require that you would "pull it in".

johnnovak commented 3 months ago

Everybody uses standard Arabic ASCII numerals these days, come on. Anyway, you guys do what you think it's best, I don't particularly care anymore 😄

A sane way to do natural sorting is to restrict yourself to standard Arabic numerals in the natural sort algo. There's no committee that will tell you this is wrong or anything... It's perfectly fine.

Alogani commented 3 months ago

I do agree with some points of tersec. Especially even the corner cases are always important and "isDigit" should not be deceptive for those corner cases.

But:

But to say who is wrong or true, let's look at what the other languages do. Like python (yes so original), who uses utf8 strings as default : https://github.com/python/cpython/blob/ab0e60101637b7cf47f3cc95813996791e7118c4/Objects/unicodeobject.c#L2365

Spoiler : all zeros are checked in python :

"\u0660".isdigit() == True
"\u06F1".isdigit() == True

So i repropose my implementation : https://github.com/nim-lang/Nim/issues/23465#issuecomment-2027970284 Or the implementation of someone else. But I still think a kind of isDigit for runes shall be added to std/lib

johnnovak commented 3 months ago

Yeah, and it's a philosophy / approach thing too. I always thought of Nim as a "sweet spot" language. Look at the Java standard libraries; they're far more "complete" than any Nim std lib. But they are huge, with a huge and complicated API surface.

Nim for me always has been 80% of the benefits with 20% of the complexity. Having said that, these ultra-detailed and specialised libs have their place too as separate packages for those who really need them.

johnnovak commented 3 months ago

The difference is only, do you use the $/toUTF8 or other approach (many creative approaches exist) to access Runes from std/unicode, or do you use this hypothetical isDigit. Both require that you would "pull it in".

Wait, is there a plan to move the various Nim std libs out into their own projects that you must then install via Nimble? I'm not too thrilled of that prospect if it's true. I like a stdlib to be slow moving like in most languages (NodeJS does not count 😄). Having to install libs just to be able to use UTF8 is a bit backwards...

demotomohiro commented 3 months ago

Are you sure supporting only ASCII digit characters is fine for 99.999% use cases? In Japan, there are cases to use fullwidth number characters like '1', '2', '3', '4', '5', ... Also, in japanese vertical writing, Japanese numerals is used. It seems Python's isnumeric function supports japanese numerals.

>>> "123".isnumeric()
True
>>> "1".isnumeric()
True
>>> "一".isnumeric()
True
>>> "二".isnumeric()
True
>>> "å„„".isnumeric()
True
>>> "å…†".isnumeric()
True

Or std/unicode doesn't need to support any languages other than english? If so, how about to name it more explicitly like func isAsciiDigit*(r: Rune): bool?

johnnovak commented 3 months ago

If so, how about to name it more explicitly like func isAsciiDigit*(r: Rune): bool?

Yeah, good points, and probably isAsciiDigit is better, after all, as @Alogani has previously suggested.

Updated the ticket.

Araq commented 3 months ago

The signature should be IMO:


func extractDigit*(r: Rune): Option[range[0..9]]

And it should support Unicode Digits.

johnnovak commented 3 months ago

Don't get me wrong, I like this idea, but to support literally all Unicode digits, range[0..9] won't do. There are quite a few historical non-base10 number systems, etc.

https://en.wikipedia.org/wiki/Numerals_in_Unicode

Hence I said, supporting all this niche stuff surely has its place, but most people writing regular computer programs aren't that interested in supporting ancient Greek and Roman number systems that Unicode can represent...

Then there's an insane number of weird "digits" here: https://character.construction/numbers

We need all this in the stdlib, really? 😅

I'm fine with Option[range[0..9]] personally and restricting it to base10 digits.

Alogani commented 3 months ago

extractDigit is indeed a more useful function than isDigit !

Looking at python (yes again) "int" function:

Further its error message is very explicit ValueError: invalid literal for int() with base 10: 'å„„' So I do agree with Araq

In fact I'm most perturbed with the use of Option, because it will be its only use in std/unicode library. But it's not a big deal (still better than try/except in that situation)

johnnovak commented 3 months ago

I like Option and would prefer the stdlib to use it more... but too late for that.

Alogani commented 3 months ago

It is another subject, but I prefer to avoid Option for performance sensitive task (and I'm not used to it, but it's API seems very simple !). Most tasks are not performance sensitive, but I tend to think that dealing with string and seq are.

For example on this small test : (It's not a proposition of code, I know those implementations are not valid for the subject !)

import mylib/testing
import std/options

func extractDigitWithErrVal*(c: char): int =
    var res = ord(c) - ord('0')
    if res < 0 or res > 9:
        return -1
    return res

func extractDigitOption*(c: char): Option[int] =
    var res = ord(c) - ord('0')
    if res < 0 or res > 9:
        return none(int)
    return some(res)

func extractDigitTry*(c: char): int =
    result = ord(c) - ord('0')
    if result < 0 or result > 9:
        raise newException(ValueError, "not a digit")

for (scenari, value) in [("Valid input", '5'), ("Invalid input", 't')]:
    echo scenari
    runBench("extractDigitWithErrVal"):
        var i = extractDigitWithErrVal(value)
        if i == -1:
            discard "Handle it"
        discard "Continue"

    runBench("extractDigitOption"):
        var i = extractDigitOption(value)
        if i.isNone:
            discard "Handle it"
        discard "Continue"

    runBench("extractDigitTry"):
        var i = try:
            extractDigitTry(value)
        except:
            discard "Handle it"
            -1 # To allow my script to go on
        discard "Continue"
>>> nim r -d:release -d:danger -d:bench test3.nim
Valid input
  >= BENCH     extractDigitWithErrVal
  >-- PASS:                                    929.80 op/μs | 1.0755 ns/op
  >= BENCH     extractDigitOption
  >-- PASS:                                    133.66 op/μs | 7.4815 ns/op
  >= BENCH     extractDigitTry
  >-- PASS:                                    306.89 op/μs | 3.2585 ns/op
Invalid input
  >= BENCH     extractDigitWithErrVal
  >-- PASS:                                    932.90 op/μs | 1.0719 ns/op
  >= BENCH     extractDigitOption
  >-- PASS:                                    140.25 op/μs | 7.1300 ns/op
  >= BENCH     extractDigitTry
  >-- PASS:                                    5.6067 op/μs | 178.36 ns/op
Araq commented 3 months ago

@Alogani Well it's a micro benchmark so it probably doesn't mean anything. However Option[int] might be worse than e.g Option[uint8] as the latter fits in a single CPU register.

Alogani commented 3 months ago

Yes, micro benchmark doesn't tell the whole truth. The final implementation will spend more time doing the calculations that dealing with options. Yet, order of magnitudes could be concerning if someone deals with huge amount of utf8 strings (which can be trivial).

But you are totally right, passing to uint8 drastically improve it (well seen, I thought on 64 byte machine, it would be negligible, and I'm quite astonised by the performance penalty of try/except even for valid inputs). So it might be acceptable (only way to know it is when it will be implemented)

Valid input
  >= BENCH     extractDigitWithErrVal
  >-- PASS:                                    953.82 op/μs | 1.0484 ns/op
  >= BENCH     extractDigitOption
  >-- PASS:                                    711.87 op/μs | 1.4048 ns/op
  >= BENCH     extractDigitTry
  >-- PASS:                                    317.10 op/μs | 3.1536 ns/op
Invalid input
  >= BENCH     extractDigitWithErrVal
  >-- PASS:                                    956.49 op/μs | 1.0455 ns/op
  >= BENCH     extractDigitOption
  >-- PASS:                                    949.67 op/μs | 1.0530 ns/op
  >= BENCH     extractDigitTry
  >-- PASS:                                    5.8279 op/μs | 171.59 ns/op

By the way, it's better to have it, even not the most performant, than have nothing.

litlighilit commented 2 months ago

it does restrict to base 10

it doesn't work with weird japanese numerals coded on multiple codepoints (altough isnumeric returns yes)

but it work on other ascii digits

Further its error message is very explicit ValueError: invalid literal for int() with base 10: '億'


I've found Python's `unicodedata`[^1] can handle this:

```Python
from unicodedata import numeric
assert numeric('å„„') == 100000000.0
# returns a float