Open abalkin opened 14 years ago
I am opening a new report to continue work on the issues raised in bpo-10557 that are either feature requests or documentation bugs.
The rest is my reply to the relevant portions of Marc's comment at msg122785.
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg \report@bugs.python.org\ wrote: ..
Alexander Belopolsky wrote: > > Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment: > > After a bit of svn archeology, it does appear that Arabic-Indic > digits' support was deliberate at least in the sense that the > feature was tested for when the code was first committed. See r15000.
As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html) this support was added intentionally.
> The test migrated from file to file over the last 10 years, but it > is still present in test_float.py: > > self.assertEqual(float(b" \u0663.\u0661\u0664 ".decode('raw-unicode-escape')), 3.14) > > (It should probably be now rewritten using a string literal.) > .. > For the future, I note that starting with Unicode 6.0.0, > the Unicode Consortium promises that > > """ > Characters with the property value Numeric_Type=de (Decimal) only > occur in contiguous ranges of 10 characters, with ascending numeric > values from 0 to 9 (Numeric_Value=0..9). > """ > > This makes it very easy to check a numeric string does not contain > a mix of digits from different scripts.
I'm not sure why you'd want to check for such ranges.
In order to disallow a mix of say Arabic-Indic and Bengali digits. Such combinations cannot be defended as possibly valid numbers in any script.
> I still believe that proper API should require explicit choice of > language or locale before allowing digits other than 0-9 just as > int() would not accept hexadecimal digits without explicit choice of > base >= 16. But this would be a subject of a feature request.
Since when do we require a locale or language to be specified when using Unicode ?
This is a valid question. I may be in minority, but I find it convenient to use int(), float() etc. for data validation. If my program gets a CSV file with Arabic-Indic digits, I want to fire the guy who prepared it before it causes real issues. :-) I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals.
On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends. int('0xFF') requires me to specify base even though 0xFF is a perfectly valid notation.
There are pros and cons in any approach. Let's figure out what is better before we fix the documentation.
The codecs, Unicode methods and other Unicode support features happily work with all kinds of languages, mixed or not, without any such specification.
In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior. If we were designing str.todigits(), by all means, I would argue that it must be consistent with str.isdigit(). For numeric data, however, I think we should follow the logic that rejected int('0xFF').
This is my opinion. We can consider allowing int('0xFF') as well. Let's discuss.
See also bpo-9574 for a somewhat related discussion.
I may be in minority, but I find it convenient to use int(), float() etc. for data validation. A number of libraries agree: argparse, HTML form handling libs, etc.
I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals. [...] On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends. [...] There are pros and cons in any approach. Indeed, tough question. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense; on the other hand, rejecting it means that the int code does know about Unicode, which you argued against.
[MAL] > The codecs, Unicode methods and other Unicode support features > happily work with all kinds of languages, mixed or not, without any > such specification. In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior. I think I agree. It’s perfectly fine that Unicode support features don’t care about the type of the characters but just encode and decode; however, int has a validation step. It rejects numerals that don’t make sense with the given base for example, so rejecting nonsensical sequences of Unicode numerals makes sense IMO.
What do the other languages that are able to convert from Unicode numerals to integer objects?
On Sat, May 7, 2011 at 11:25 AM, Éric Araujo \report@bugs.python.org\ wrote:
.. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali does not make sense; on the other hand, rejecting it means that the int code does know about Unicode, which you argued against.
In order to flag use of mixed scripts in numerals, the code does not require access to any additional unicode data. Since Unicode 6.0.0, programmers can rely on the following stability promise:
""" Characters with the property value Numeric_Type=de (Decimal) only occur in contiguous ranges of 10 characters, with ascending numeric values from 0 to 9 (Numeric_Value=0..9). """ -- http://www.unicode.org/policies/stability_policy.html
Therefore, the validation code can simply check that for all digits in the number, ord(d) - unicodedata.numeric(d) is the same.
I find it convenient to use int(), float() etc. for data validation.
Me too. This is why I'd still be happiest with int and float not accepting non-ASCII digits at all. (And also why the recent suggestions to allow extra underscores in int and float input make me uneasy.)
I've changed my mind :-)
Restricting the decimal encoder to only accept code points from one of the possible decimal digit ranges is a good idea. Let's do that.
It looks like we a approaching consensus on some points:
Open question: should we accept fullwidth + and -, sub/superscript variants etc.? I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers. This would allow parsing strings like this:
>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
-1.2
On 12.06.2013 07:32, Alexander Belopolsky wrote:
>
> Alexander Belopolsky added the comment:
>
> It looks like we a approaching consensus on some points:
>
> 1. Mixed script numerals should be disallowed.
> 2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'
>
> Open question: should we accept fullwidth + and -, sub/superscript variants etc.? I believe rather than debating variant codepoints one by one, we should consider applying NFKC (compatibility) normalization to unicode strings to be interpreted as numbers. This would allow parsing strings like this:
>
>>>> float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL STOP}\N{FULLWIDTH DIGIT TWO}'))
> -1.2
While it would solve these cases, I think that would cause a significant performance hit.
Perhaps we could do this in two phases:
I think PEP-393 gives us a quick way to fast parsing: if the max char is \< 128, just roll straight into normal processing, otherwise do the normalisation and "all decimal digits are from the same script" steps.
There are almost certainly better ways to do the script translation, but the example below tries to just do the "convert to ASCII" step to avoid duplicating the +/- and decimal point processing logic:
if max_char(arg) >= 128:
arg = toNFKC(arg)
originals = set()
converted = []
for c in arg:
try:
d = str(unicodedata.decimal(c))
except ValueError:
d = c
else:
originals.add(c)
converted.append(d)
if (max(originals) - min(originals)) >= 10:
raise ValueError("%s mixes digits from multiple scripts" % arg)
arg = "".join(converted)
result = parse_ascii_number(arg)
P.S. I don't think the base argument is especially applicable ('0x' is rejected because 'x' is not a base 10 digit and we allow a base of '0' to request "use int literal base markers").
PEP-393 implementation has already added the fast path to decimal encoding:
http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735
What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:
value = code - (first_code - first_value)
if not 0 <= value < 10:
raise or fall back to UCD lookup
On 14.06.2013 03:43, Alexander Belopolsky wrote:
Alexander Belopolsky added the comment:
PEP-393 implementation has already added the fast path to decimal encoding:
http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735
What we can do, however, is improve performance of converting non-ascii numerals by looking up only the first digit's value and converting the rest using simple:
value = code - (first_code - first_value) if not 0 \<= value \< 10: raise or fall back to UCD lookup
I'm not sure whether just relying on PEP-393 is good enough.
Of course, you can special case the conversion based on the kind, but that's only one form of optimization.
Slicing operations don't recheck the max code point used in the substring. As a result, a slice may very well be of the UCS2 kind, even though the text itself is ASCII.
Apart from the fast-path based on the string kind, I think the decimal encoder would also have to scan the string for non-ASCII code points. If it finds non-ASCII code points, it would have to call the normalizer and restart the scan based on the normalized string.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Jun 14 2013)
>>> Python Projects, Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2013-07-01: EuroPython 2013, Florence, Italy ... 17 days to go
2013-07-16: Python Meeting Duesseldorf ... 32 days to go
::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
I took another look at the library reference and it looks like when it comes to non-ascii digits support, the reference contradicts itself. On one hand,
""" int(x, base=10)
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in radix base. Optionally, the literal can be preceded by + or - (with no space in between) and surrounded by whitespace. """ \http://docs.python.org/3/library/functions.html#int\
.. suggests that only "an integer literal" will be accepted by int(), but on the other hand, a note in the "Numeric Types" section says: "The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property)." \http://docs.python.org/3/library/stdtypes.html#typesnumeric\
It also appears that "surrounded by whitespace" part is not entirely correct:
>>> '\N{RS}'.isspace()
True
>>> int('123\N{RS}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e'
This is probably a bug in the current implementation and I will open a separate issue for that.
i opened bpo-18236 to address the issue of surrounding whitespace.
I have started a rough prototype for what I plan to eventually reimplement in C and propose as a patch here.
https://bitbucket.org/alexander_belopolsky/misc/src/c175171cc76e/utoi.py?at=master
Comments welcome.
Martin v. Löwis wrote at bpo-18236 (msg191687):
int conversion ultimately uses Py_ISSPACE, which conceptually could deviate from the Unicode properties (as it is byte-based). This is not really an issue, since they indeed match.
Py_ISSPACE matches Unicode White_Space property in the ASII range (first 128 code points) it differs for byte (code point) values from 128 through 255. This leads to the following discrepancy:
>>> int('123\xa0')
123
but
>>> int(b'123\xa0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid start byte
>>> int('123\xa0'.encode())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\xa0'
For the last discrepancy see bpo-16741. It have a patch which should fix this.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/abalkin' closed_at = None created_at =
labels = ['interpreter-core', 'type-feature', 'docs']
title = 'Review and document string format accepted in numeric data type constructors'
updated_at =
user = 'https://github.com/abalkin'
```
bugs.python.org fields:
```python
activity =
actor = 'skrah'
assignee = 'belopolsky'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Interpreter Core']
creation =
creator = 'belopolsky'
dependencies = []
files = []
hgrepos = []
issue_num = 10581
keywords = []
message_count = 16.0
messages = ['122834', '122835', '135469', '135536', '135977', '190949', '191011', '191014', '191081', '191101', '191105', '191302', '191304', '191314', '191709', '191720']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'mark.dickinson', 'ncoghlan', 'belopolsky', 'vstinner', 'eric.smith', 'ezio.melotti', 'eric.araujo', 'cvrebert', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue10581'
versions = ['Python 3.3', 'Python 3.4']
```