Open johnnovak opened 3 months ago
I think that such a function might be a bit too misleading for unicode, a proper Unicode implementation will be harder to do, but is already done in e.g. https://github.com/nitely/nim-unicodedb:
assert utmDigit in Rune(0x00B2).unicodeTypes()
https://github.com/nitely/nim-unicodedb?tab=readme-ov-file#types
I think that such a function might be a bit too misleading for unicode, a proper Unicode implementation will be harder to do, but is already done in e.g. https://github.com/nitely/nim-unicodedb:
assert utmDigit in Rune(0x00B2).unicodeTypes()
https://github.com/nitely/nim-unicodedb?tab=readme-ov-file#types
Oh right. Wasn't aware of these extra Unicode numeric runes.
Ok, so I've looked into it, and I still think my implementation is fine for 99.999% use cases. Sure, there are all sorts of other interesting Unicode digit code points, but how often do people use these... extremely rarely, I'd say.
In real life, the 0 to 9 ASCII chars are enough.
https://www.fileformat.info/info/unicode/category/Nd/list.htm
Hello,
Maybe this implementation ?
const allZeros = [48, 1632, 1776, 1984, 2406, 2534, 2662, 2790, 2918, 3046, 3174, 3302, 3430, 3558, 3664, 3792, 3872, 4160, 4240, 6112, 6160, 6470, 6608, 6784, 6800, 6992, 7088, 7232, 7248, 42528, 43216, 43264, 43472, 43504, 43600, 44016, 65296, 4170, 4307, 4358, 4367, 4371, 4381, 4399, 4421, 4429, 4453, 4460, 4467, 4494, 4501, 4549, 4565, 4570, 4597, 5798, 5804, 5813, 7548, 7549, 7550, 7550, 7551, 7700, 7727, 7759, 7829, 8127]
# List of zero converted from https://www.fileformat.info/info/unicode/category/Nd/list.htm
func toDigitImpl(r: Rune): int =
let codePoint = ord(r)
for z in allZeros:
# not a binary search, because first runes are more common
if codePoint >= z:
return codePoint - z
return -1
func isDigit*(r: Rune): bool =
let digit = r.toDigitImpl()
digit in {0..9}
func toDigit*(r: Rune): int =
result = r.toDigitImpl()
if result notin {0..9}:
raise newException(RangeDefect, "rune is not a valid digit")
It should need some unit tests case
Ok, so I've looked into it, and I still think my implementation is fine for 99.999% use cases. Sure, there are all sorts of other interesting Unicode digit code points, but how often do people use these... extremely rarely, I'd say.
In real life, the 0 to 9 ASCII chars are enough.
https://www.fileformat.info/info/unicode/category/Nd/list.htm
Nim's standard library is already full of random jank which comes from this approach. Please do this correctly if it's done at all.
Not just some "in real life ..." approach dismissive of what you regard as irrelevant edge cases.
And, if that's too difficult, that's maybe an indication that what you're looking for is conceptually nontrivial, and not just to be casually tossed into the Nim stdlib.
Dunno man. I'd personally take a 99.99% solution than no solution, or having to install a library for the 0.01% use case that I'll never need. I guess this is an approach thing, we are all a bit different 😄
Having full support for all the obscure and historical languages and their codepoints seems like a specialised, niche use case to me, not the standard use case.
Thinking about standard use cases vs niche use cases might be a good guide for the stdlib.
I've looked at your Unicode library. It's nice and super handy for someone who needs that level of support. But personally I don't want to pull it in if I only need a simple isDigit
on Rune that will work with all languages in actual use.
If its tailored for your use case why does it have to be in everyone's stdlib?
If its tailored for your use case why does it have to be in everyone's stdlib?
Because it's practical and 99.99% of people find it useful, and most people haven't even heard of those obscure languages with those alternative Unicode codepoints, neither will come near them in their lifetime?
Good enough reason? 😆
To be more productive, it's more fruitful to think in terms of levels of support. Something similar to what's described here: https://cldr.unicode.org/index/language-support-levels
I'd imagine supporting most major languages in actual use is a good practical level of support. Which my isDigit
does. Then supporting historical and obscure languages, at the expense of added complexity, extra databases, libraries, and whatnot, is the full support, but I don't think many people ever need that level.
So I think a isDigit implementation should be added for rune in std/unicode, either the one from johnnovak with a proper documentation/proper name about its limitation (really important !), either the one I propose (but which may be too specialized for stdlib).
Dunno man. I'd personally take a 99.99% solution than no solution, or having to install a library for the 0.01% use case that I'll never need.
This is a false dichotomy. Implementing isAsciiDigit
for Rune
neither has no solution, nor requires installing a library, as @Alogani alludes to. One (but not the only, or necessarily best) approach is:
import std/[sequtils, strutils, unicode]
func isAsciiDigit(x: Rune): bool = ($x).allIt(it.isDigit)
for digit in Digits:
doAssert ($digit).runeAt(0).isAsciiDigit
for letter in Letters:
doAssert not ($letter).runeAt(0).isAsciiDigit
allIt
is questionable for this purpose; one can/should exploit UTF-8 encoding more closely. And, this should work for multi-byte UTF-8 encodings too, because of how UTF-8 works. The high bit will be set, et cetera.
This also addresses
No one wants to rewrite the wheel (or download dozens of nimble packages) because simple things are not present in the stdlib
What "dozens of nimble packages" do you refer to here?
This is the ASCII case, of course. But since you've already indicated that's all you care about, literally one line will do it, plus the tests of which I provided some non-exhaustive examples, without having to create an attractive nuisance in the stdlib. One might note that nothing else in std/unicode restricts itself to ASCII. The point of the module is to support Unicode. That means Unicode casing, Unicode spaces/whitespace, Unicode alpha/letters, et cetera.
Thinking about standard use cases vs niche use cases might be a good guide for the stdlib.
You're already planning on using this functionality in https://github.com/nim-lang/Nim/issues/23462 it looks like? Which is also supposed to go in the stdlib. So it wouldn't just be this function, but this natural sort with this ASCII-digits constraints. This illustrates the point, even: you're already, right now, planning on propagating this brokenness to other parts of the stdlib.
I've looked at your Unicode library. It's nice and super handy for someone who needs that level of support. But personally I don't want to pull it in if I only need a simple
isDigit
on Rune that will work with all languages in actual use.
I don't understand the logic here. Aside from more of the dismissal of other languages ("in actual use"), you're proposing, I think, to add
func isDigit*(r: Rune): bool =
ord(r) >= ord('0') and ord(r) <= ord('9')
in this very issue, to, presumably (?) the std/unicode
library. Which one wouldn't be able to access if you aren't willing to "pull it in".
The difference is only, do you use the $
/toUTF8
or other approach (many creative approaches exist) to access Rune
s from std/unicode
, or do you use this hypothetical isDigit
. Both require that you would "pull it in".
Everybody uses standard Arabic ASCII numerals these days, come on. Anyway, you guys do what you think it's best, I don't particularly care anymore 😄
A sane way to do natural sorting is to restrict yourself to standard Arabic numerals in the natural sort algo. There's no committee that will tell you this is wrong or anything... It's perfectly fine.
I do agree with some points of tersec. Especially even the corner cases are always important and "isDigit" should not be deceptive for those corner cases.
But:
But to say who is wrong or true, let's look at what the other languages do. Like python (yes so original), who uses utf8 strings as default : https://github.com/python/cpython/blob/ab0e60101637b7cf47f3cc95813996791e7118c4/Objects/unicodeobject.c#L2365
Spoiler : all zeros are checked in python :
"\u0660".isdigit() == True
"\u06F1".isdigit() == True
So i repropose my implementation : https://github.com/nim-lang/Nim/issues/23465#issuecomment-2027970284 Or the implementation of someone else. But I still think a kind of isDigit for runes shall be added to std/lib
Yeah, and it's a philosophy / approach thing too. I always thought of Nim as a "sweet spot" language. Look at the Java standard libraries; they're far more "complete" than any Nim std lib. But they are huge, with a huge and complicated API surface.
Nim for me always has been 80% of the benefits with 20% of the complexity. Having said that, these ultra-detailed and specialised libs have their place too as separate packages for those who really need them.
The difference is only, do you use the
$
/toUTF8
or other approach (many creative approaches exist) to accessRune
s fromstd/unicode
, or do you use this hypotheticalisDigit
. Both require that you would "pull it in".
Wait, is there a plan to move the various Nim std libs out into their own projects that you must then install via Nimble? I'm not too thrilled of that prospect if it's true. I like a stdlib to be slow moving like in most languages (NodeJS does not count 😄). Having to install libs just to be able to use UTF8 is a bit backwards...
Are you sure supporting only ASCII digit characters is fine for 99.999% use cases?
In Japan, there are cases to use fullwidth number characters like '1', '2', '3', '4', '5', ...
Also, in japanese vertical writing, Japanese numerals is used.
It seems Python's isnumeric
function supports japanese numerals.
>>> "123".isnumeric()
True
>>> "1".isnumeric()
True
>>> "一".isnumeric()
True
>>> "二".isnumeric()
True
>>> "å„„".isnumeric()
True
>>> "å…†".isnumeric()
True
Or std/unicode doesn't need to support any languages other than english?
If so, how about to name it more explicitly like func isAsciiDigit*(r: Rune): bool
?
If so, how about to name it more explicitly like
func isAsciiDigit*(r: Rune): bool
?
Yeah, good points, and probably isAsciiDigit
is better, after all, as @Alogani has previously suggested.
Updated the ticket.
The signature should be IMO:
func extractDigit*(r: Rune): Option[range[0..9]]
And it should support Unicode Digits.
Don't get me wrong, I like this idea, but to support literally all Unicode digits, range[0..9]
won't do. There are quite a few historical non-base10 number systems, etc.
https://en.wikipedia.org/wiki/Numerals_in_Unicode
Hence I said, supporting all this niche stuff surely has its place, but most people writing regular computer programs aren't that interested in supporting ancient Greek and Roman number systems that Unicode can represent...
Then there's an insane number of weird "digits" here: https://character.construction/numbers
We need all this in the stdlib, really? 😅
I'm fine with Option[range[0..9]]
personally and restricting it to base10 digits.
extractDigit is indeed a more useful function than isDigit !
Looking at python (yes again) "int" function:
Further its error message is very explicit ValueError: invalid literal for int() with base 10: 'å„„'
So I do agree with Araq
In fact I'm most perturbed with the use of Option, because it will be its only use in std/unicode library. But it's not a big deal (still better than try/except in that situation)
I like Option and would prefer the stdlib to use it more... but too late for that.
It is another subject, but I prefer to avoid Option for performance sensitive task (and I'm not used to it, but it's API seems very simple !). Most tasks are not performance sensitive, but I tend to think that dealing with string and seq are.
For example on this small test : (It's not a proposition of code, I know those implementations are not valid for the subject !)
import mylib/testing
import std/options
func extractDigitWithErrVal*(c: char): int =
var res = ord(c) - ord('0')
if res < 0 or res > 9:
return -1
return res
func extractDigitOption*(c: char): Option[int] =
var res = ord(c) - ord('0')
if res < 0 or res > 9:
return none(int)
return some(res)
func extractDigitTry*(c: char): int =
result = ord(c) - ord('0')
if result < 0 or result > 9:
raise newException(ValueError, "not a digit")
for (scenari, value) in [("Valid input", '5'), ("Invalid input", 't')]:
echo scenari
runBench("extractDigitWithErrVal"):
var i = extractDigitWithErrVal(value)
if i == -1:
discard "Handle it"
discard "Continue"
runBench("extractDigitOption"):
var i = extractDigitOption(value)
if i.isNone:
discard "Handle it"
discard "Continue"
runBench("extractDigitTry"):
var i = try:
extractDigitTry(value)
except:
discard "Handle it"
-1 # To allow my script to go on
discard "Continue"
>>> nim r -d:release -d:danger -d:bench test3.nim
Valid input
>= BENCH extractDigitWithErrVal
>-- PASS: 929.80 op/μs | 1.0755 ns/op
>= BENCH extractDigitOption
>-- PASS: 133.66 op/μs | 7.4815 ns/op
>= BENCH extractDigitTry
>-- PASS: 306.89 op/μs | 3.2585 ns/op
Invalid input
>= BENCH extractDigitWithErrVal
>-- PASS: 932.90 op/μs | 1.0719 ns/op
>= BENCH extractDigitOption
>-- PASS: 140.25 op/μs | 7.1300 ns/op
>= BENCH extractDigitTry
>-- PASS: 5.6067 op/μs | 178.36 ns/op
@Alogani Well it's a micro benchmark so it probably doesn't mean anything. However Option[int]
might be worse than e.g Option[uint8]
as the latter fits in a single CPU register.
Yes, micro benchmark doesn't tell the whole truth. The final implementation will spend more time doing the calculations that dealing with options. Yet, order of magnitudes could be concerning if someone deals with huge amount of utf8 strings (which can be trivial).
But you are totally right, passing to uint8 drastically improve it (well seen, I thought on 64 byte machine, it would be negligible, and I'm quite astonised by the performance penalty of try/except even for valid inputs). So it might be acceptable (only way to know it is when it will be implemented)
Valid input
>= BENCH extractDigitWithErrVal
>-- PASS: 953.82 op/μs | 1.0484 ns/op
>= BENCH extractDigitOption
>-- PASS: 711.87 op/μs | 1.4048 ns/op
>= BENCH extractDigitTry
>-- PASS: 317.10 op/μs | 3.1536 ns/op
Invalid input
>= BENCH extractDigitWithErrVal
>-- PASS: 956.49 op/μs | 1.0455 ns/op
>= BENCH extractDigitOption
>-- PASS: 949.67 op/μs | 1.0530 ns/op
>= BENCH extractDigitTry
>-- PASS: 5.8279 op/μs | 171.59 ns/op
By the way, it's better to have it, even not the most performant, than have nothing.
it does restrict to base 10
it doesn't work with weird japanese numerals coded on multiple codepoints (altough isnumeric returns yes)
but it work on other ascii digits
Further its error message is very explicit ValueError: invalid literal for int() with base 10: '億'
I've found Python's `unicodedata`[^1] can handle this:
```Python
from unicodedata import numeric
assert numeric('å„„') == 100000000.0
# returns a float
Summary
Some util functions that are defined for
char
s are missing forRune
s.isAsciiDigit
is one of them.Description
Alternatives
No response
Examples
No response
Backwards Compatibility
No response
Links
No response