pytest-dev / iniconfig

MIT License
58 stars 31 forks source link

strip unicode whitespace #4

Open gaborbernat opened 6 years ago

gaborbernat commented 6 years ago

At the moment the parser uses the built-in string strip function (https://github.com/RonnyPfannschmidt/iniconfig/blob/master/iniconfig.py#L134). Nowadays is more and more common that people inadvertently introduce Unicode whitespaces in their configuration. The string strip function does not remove this. When we detect such we should at least warn (especially on the left-hand side), or fail. Failing to strip such characters is never the desired effect, as it effectively generates a new key that visually seems the same leaving the user in confusion of why his config does not work.

Note the py package uses this package vendored, and via that pytest/tox too (https://github.com/pytest-dev/py/blob/master/py/_vendored_packages/iniconfig.py).

Issues generated by this omission:

Detecting unicode spaces https://stackoverflow.com/questions/8921365/in-python-how-to-list-all-characters-matched-by-posix-extended-regex-space/37903375#37903375

RonnyPfannschmidt commented 6 years ago

as is iniconfig is not unicode aware, so this one is a really tricky one to fix its def not on my own roadmap

gaborbernat commented 6 years ago

@RonnyPfannschmidt what about doing strip as (in case the content read from the file is of type unicode):

>>> from unicodedata import name
>>> import sys
>>> import re
>>> spaces = u''.join(re.findall(r'\s', u''.join(unichr(c) for c in xrange(sys.maxunicode+1)), re.UNICODE))
>>> 'envlist '.strip(spaces)
u'envlist'
>>> name(u'\xA0')
'NO-BREAK SPACE
>>> u'envlist\xA0'.strip(spaces)
u'envlist'
>>> 
gaborbernat commented 6 years ago

An alternative solution would be to try to avoid such configs by either warning when such lines are detected.

RonnyPfannschmidt commented 6 years ago

i am fine with any consistent solution - i just don't dont have the time/motivation to do the complete dance of ensuring its consistent

right now iniconfig is not unicode aware and works in terms of native strings and ascii whitespace

note that your example generates the spaces in a pretty expensive manner

RonnyPfannschmidt commented 1 year ago

@gaborbernat with the new release unicode support is in but unicode whitespace is not yet considered