Closed dimpase closed 4 years ago
an easy way out is just to check whether the locale change worked, and if not, use C
locale, not C.UTF-8
. Perhaps print a warning.
Replying to @dimpase:
an easy way out is just to check whether the locale change worked, and if not, use
C
locale, notC.UTF-8
.
+1
Author: Dima Pasechnik
Either this ticket is a duplicate of #30008 or it should make #30008 obsolete.
this ticket was a result of a bug report on Arch, not centos. Hopefully it works for #30008 too.
Tests did not complete, because the 9.2.beta3 tests fail everywhere.
https://github.com/sagemath/sage/actions/runs/157607524
Is this a github issue or have we broken sage?
Step 13/18 : RUN ./bootstrap
---> Running in 89427ef5c1c4
rm -rf config configure build/make/Makefile-auto.in
rm -f src/doc/en/installation/*.txt
rm -rf src/doc/en/reference/spkg/*.rst
rm -f src/doc/en/reference/repl/*.txt
src/doc/bootstrap:48: installing src/doc/en/installation/arch.txt and src/doc/en/installation/arch-optional.txt
src/doc/bootstrap:48: installing src/doc/en/installation/debian.txt and src/doc/en/installation/debian-optional.txt
src/doc/bootstrap:48: installing src/doc/en/installation/fedora.txt and src/doc/en/installation/fedora-optional.txt
src/doc/bootstrap:48: installing src/doc/en/installation/cygwin.txt and src/doc/en/installation/cygwin-optional.txt
src/doc/bootstrap:48: installing src/doc/en/installation/homebrew.txt and src/doc/en/installation/homebrew-optional.txt
src/doc/bootstrap:55: installing src/doc/en/reference/spkg/*.rst
src/doc/bootstrap:83: installing src/doc/en/reference/repl/options.txt
src/doc/bootstrap: line 84: src/doc/en/reference/repl/options.txt: No such file or directory
The command '/bin/sh -c ./bootstrap' returned a non-zero code: 1
I just found #30064. Edit: I was cc on that, but I didn't realize how serious this is.
Ok. I'll run a new test then.
This breaks building sphinx on windows.
https://github.com/kliem/sage/runs/838933940
Same error as #30008.
As far as I understand the problem is that we need some sort of UTF to make the sphinx build work.
It appears that on cygwin
the default is better than C
and C.UTF-8
does not work.
So maybe C
is not the best alternative for C.UTF-8
.
Btw, strangely centos 7 appears to work with the current beta. I don't know what happened. (And I don't know yet, if this behavior is stable).
And it breaks centos 8.
Replying to @kliem:
This breaks building sphinx on windows.
https://github.com/kliem/sage/runs/838933940
Same error as #30008.
As far as I understand the problem is that we need some sort of UTF to make the sphinx build work. It appears that on
cygwin
the default is better thanC
andC.UTF-8
does not work. So maybeC
is not the best alternative forC.UTF-8
.
I'm not sure what you mean here. C.UTF-8
is supported on Cygwin and is in fact the default locale in absence of any other settings: https://www.cygwin.com/cygwin-ug-net/setup-locale.html
The default locale in the absence of the aforementioned locale environment variables is "C.UTF-8".
the error UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 45: ordinal not in range(128)
:
2020-07-05T16:29:58.1676169Z [sphinx-3.0.4.p0] installing. Log file: /cygdrive/d/a/sage/sage/logs/pkgs/sphinx-3.0.4.p0.log
2020-07-05T16:30:01.7883565Z [sphinx-3.0.4.p0] error installing, exit status 1. End of log file:
2020-07-05T16:30:01.8141086Z [sphinx-3.0.4.p0] Found local metadata for sphinx-3.0.4.p0
2020-07-05T16:30:01.8155620Z [sphinx-3.0.4.p0] Attempting to download package Sphinx-3.0.4.tar.gz from mirrors
2020-07-05T16:30:01.8169727Z [sphinx-3.0.4.p0] http://mirrors.mit.edu/sage/spkg/upstream/sphinx/Sphinx-3.0.4.tar.gz
2020-07-05T16:30:01.8174798Z [sphinx-3.0.4.p0] [......................................................................]
2020-07-05T16:30:01.8191111Z [sphinx-3.0.4.p0] sphinx-3.0.4.p0
2020-07-05T16:30:01.8193080Z [sphinx-3.0.4.p0] ====================================================
2020-07-05T16:30:01.8197343Z [sphinx-3.0.4.p0] Setting up build directory for sphinx-3.0.4.p0
2020-07-05T16:30:01.8217424Z [sphinx-3.0.4.p0] Traceback (most recent call last):
2020-07-05T16:30:01.8221790Z [sphinx-3.0.4.p0] File "/cygdrive/d/a/sage/sage/build/bin/sage-uncompress-spkg", line 23, in <module>
2020-07-05T16:30:01.8222267Z [sphinx-3.0.4.p0] run()
2020-07-05T16:30:01.8222576Z [sphinx-3.0.4.p0] File "/cygdrive/d/a/sage/sage/build/bin/../sage_bootstrap/uncompress/cmdline.py", line 72, in run
2020-07-05T16:30:01.8222857Z [sphinx-3.0.4.p0] unpack_archive(archive, dirname)
2020-07-05T16:30:01.8223251Z [sphinx-3.0.4.p0] File "/cygdrive/d/a/sage/sage/build/bin/../sage_bootstrap/uncompress/action.py", line 68, in unpack_archive
2020-07-05T16:30:01.8223583Z [sphinx-3.0.4.p0] archive.extractall(members=archive.names)
2020-07-05T16:30:01.8223861Z [sphinx-3.0.4.p0] File "/cygdrive/d/a/sage/sage/build/bin/../sage_bootstrap/uncompress/tar_file.py", line 96, in extractall
2020-07-05T16:30:01.8224117Z [sphinx-3.0.4.p0] **kwargs)
2020-07-05T16:30:01.8224323Z [sphinx-3.0.4.p0] File "/usr/lib/python3.6/tarfile.py", line 2010, in extractall
2020-07-05T16:30:01.8224793Z [sphinx-3.0.4.p0] numeric_owner=numeric_owner)
2020-07-05T16:30:01.8225151Z [sphinx-3.0.4.p0] File "/usr/lib/python3.6/tarfile.py", line 2052, in extract
2020-07-05T16:30:01.8225442Z [sphinx-3.0.4.p0] numeric_owner=numeric_owner)
2020-07-05T16:30:01.8225898Z [sphinx-3.0.4.p0] File "/cygdrive/d/a/sage/sage/build/bin/../sage_bootstrap/uncompress/tar_file.py", line 122, in _extract_member
2020-07-05T16:30:01.8226166Z [sphinx-3.0.4.p0] **kwargs)
2020-07-05T16:30:01.8226601Z [sphinx-3.0.4.p0] File "/usr/lib/python3.6/tarfile.py", line 2122, in _extract_member
2020-07-05T16:30:01.8226883Z [sphinx-3.0.4.p0] self.makefile(tarinfo, targetpath)
2020-07-05T16:30:01.8227488Z [sphinx-3.0.4.p0] File "/usr/lib/python3.6/tarfile.py", line 2163, in makefile
2020-07-05T16:30:01.8227940Z [sphinx-3.0.4.p0] with bltn_open(targetpath, "wb") as target:
2020-07-05T16:30:01.8228249Z [sphinx-3.0.4.p0] UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 45: ordinal not in range(128)
2020-07-05T16:30:01.8228542Z [sphinx-3.0.4.p0] ************************************************************************
2020-07-05T16:30:01.8228988Z [sphinx-3.0.4.p0] Error: failed to extract /cygdrive/d/a/sage/sage/upstream/Sphinx-3.0.4.tar.gz
2020-07-05T16:30:01.8229264Z [sphinx-3.0.4.p0] ************************************************************************
2020-07-05T16:30:01.8229537Z [sphinx-3.0.4.p0] Full log file: /cygdrive/d/a/sage/sage/logs/pkgs/sphinx-3.0.4.p0.log
could it be that locale
on Cygwin is not installed by default?
However, https://www.cygwin.com/cygwin-ug-net/setup-locale.html says:
Note
For a list of locales supported by your Windows machine, use the new locale -a command, which is part of the Cygwin package. For a description see locale(1)
Replying to @kliem:
I started test runs:
I just started rerunning those tests on top of the current beta. Maybe that stuff just goes away by itself.
Still causes this error.
If the centos issue is caused by the sphinx upgrade (according to #30008), why is it blocking this? This is meant to fix another (very annoying) issue on Arch.
It appears that #30008 fixed itself. However, this here broke the cygwin sphinx build, last I checked.
It have no clue what is going on, but with this ticket we go from passing to failing.
It seems a default setting on Cygwin is LANG=en_US.UTF-8
. Perhaps we can try to only set LC_ALL
if LANG
is not already set or something like this.
Also it should be investigated whether it was really necessary to add this line in #29033 to achieve Python 3.6 support. In particular note that sage-uncompress-spkg
uses sage-system-python
(which can even be python2) -- which really has nothing to do with Python 3.6 support (which is about PYTHON_FOR_VENV
).
What problems arise if we drop the locale mangling entirely? Trac #15791 doesn't mention a problem.
Description changed:
---
+++
@@ -1,2 +1,5 @@
In #29033 in `build/bin/sage-spkg` LC_ALL was changed to C.UTF-8
However, not all systems have it.
+
+There are also some other locale problems that show up in doctests
+(for example https://groups.google.com/d/msg/sage-release/spalYgXKr-4/ZVsbgHIlAgAJ)
Description changed:
---
+++
@@ -3,3 +3,7 @@
There are also some other locale problems that show up in doctests
(for example https://groups.google.com/d/msg/sage-release/spalYgXKr-4/ZVsbgHIlAgAJ)
+
+
+See also:
+- #22659
I am not sure if this is related, but while compiling Cypari on macOS, every file gives a warning of this type:
Colperl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LC_ALL = "C.UTF-8",
LC_TERMINAL = "iTerm2",
LANG = "de_DE.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("de_DE.UTF-8").
The fallback that is used seems to work for me, though.
I just did a fresh build with LC_ALL=C
and see no outstanding problems (python-3.7.8).
Maybe we should just revert that line? Why rock the boat?
Erik is the only other person who might know why it was added.
This will need testing on centos-8
with Python 3.6
Replying to @orlitzky:
I just did a fresh build with
LC_ALL=C
and see no outstanding problems (python-3.7.8).Maybe we should just revert that line? Why rock the boat?
No. This was added for reasons. Specifically to ensure compatibility between how Python 3.6 and Python 3.7 set the default encoding. Without this, there were bugs on Python 3.6 with Python not using a unicode character encoding by default. See https://www.python.org/dev/peps/pep-0538/
The simplest way to deal with this problem for currently released versions of CPython is to explicitly set a more sensible locale when launching the application. For example:
LC_CTYPE=C.UTF-8 python3 ...
The C.UTF-8 locale is a full locale definition that uses UTF-8 for the LC_CTYPE category, and the same settings as the C locale for all other categories (including LC_COLLATE). It is offered by a number of Linux distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an alternative to the ASCII-based C locale. Some other platforms (such as HP-UX) offer an equivalent locale definition under the name C.utf8.
Mac OS X and other *BSD systems have taken a different approach: instead of offering a C.UTF-8 locale, they offer a partial UTF-8 locale that only defines the LC_CTYPE category. On such systems, the preferred environmental locale adjustment is to set LC_CTYPE=UTF-8 rather than to set LC_ALL or LANG.
Perhaps this should also try the LC_CTYPE=UTF-8
mentioned here. Otherwise Dima's approach makes sense, though it can't guarantee that everything will just work on those systems. I can't recall exactly what broke without it but I do recall there was something. With Python 3.7 (the default when not using the system Python) this shouldn't be a problem since Python will basically force a UTF-8 locale for itself.
Replying to @embray:
Perhaps this should also try the
LC_CTYPE=UTF-8
mentioned here. Otherwise Dima's approach makes sense, though it can't guarantee that everything will just work on those systems. I can't recall exactly what broke without it but I do recall there was something. With Python 3.7 (the default when not using the system Python) this shouldn't be a problem since Python will basically force a UTF-8 locale for itself.
How about only doing this export LC_...=
on Python3.x with x<7 ?
This should in particular make Arch people (apparently Arch has no C.UTF-8 or a similar locale,
everything UTF-8 there is language-specific) happy, as their Python is new enough.
Ok, thanks for the information. I think the major take-away from PEP538 is,
With this change, any *nix platform that does not offer at least one of the C.UTF-8, C.utf8 or UTF-8 locales as part of its standard configuration would only be considered a fully supported platform for CPython 3.7+ deployments when a suitable locale other than the default C locale is configured explicitly (e.g. en_AU.UTF-8, zh_CN.gb18030).
I'm pretty sure we have files that actually need the UTF-8 encoding by now, so that rules out the possibility of "doing nothing" (leaving LC_ALL=C
or LC_CTYPE=C
) on python-3.6. And if we want to make python-3.6 work the way that python-3.7 does, then we're in the same situation as upstream is with respect to C.UTF-8: we have to consider python-3.6 with no C.UTF-8 (or equivalent) unsupported.
So I see two real options left:
C.UTF-8
or C.utf8
or UTF-8
when python-3.6 is being used, and declare the system unsupported if we can't. If python-3.7+ is being used, we can set the locale to C
, and it will coerce the locale to something utf8ish on its own. In either case, a lack of utf8 locale would be unsupported.en_US.UTF-8
would work too.Long-term, as one of the largest python projects in existence, I think we probably have to suck it up and go with (1), even though it pains me to require a locale that glibc doesn't even ship and isn't POSIX. Whatever decisions python makes, we're stuck with.
Replying to @dimpase:
How about only doing this
export LC_...=
on Python3.x with x<7 ? This should in particular make Arch people (apparently Arch has no C.UTF-8 or a similar locale, everything UTF-8 there is language-specific) happy, as their Python is new enough.
I think a combination of this and the current branch is the best we can do. On python-3.6, we should try to set LC_ALL
to C.UTF-8
, C.utf8
, or UTF-8
. If we can't, then we should leave it alone and pray that the user's locale is compatible with all of our SPKGs. That situation would be unsupported by sage.
On python-3.7+, we can set LC_ALL=C
, and python itself will try to pick an appropriate UTF-8 version of the locale. What does python on arch do in this situation? It's possible that python itself will fail to find a suitable UTF-8 locale, but there's not a lot we can do if upstream python insists on a nonstandard locale. Arch will just have to reconsider their decision unless they want to be not-fully supported by upstream python-3.7+.
Personally I don't care what glibc or POSIX say on this. I think 1) is a fine option.
Apparently we carry the C.UTF-8 patch in Gentoo for systemd, who definitely don't care about portability:
Is anyone working on this?
Sage-the-python-library can't realistically support anything that CPython does not; Its 2020, who in their right mind doesn't support utf-8? Better diagnostics for non-compliant systems would be great but imho not a blocker.
Replying to @vbraun:
Sage-the-python-library can't realistically support anything that CPython does not; Its 2020, who in their right mind doesn't support utf-8? Better diagnostics for non-compliant systems would be great but imho not a blocker.
These systems do support UTF-8, but not the (as of yet) non-standard C.UTF-8
locale.
The current branch has the right idea, but since python-3.6 and python-3.7 act differently, it can be made a bit more precise. With python-3.7+, we can set LC_ALL=C
and let python do the guessing. (Maybe it doesn't succeed, but officially Not Our Problem at that point.) With python-3.6, we can check for the C.UTF-8
locale and set it when found, with a fallback to LC_ALL=C
. The current branch does this unconditionally but, it should only do it for python-3.6 and we should check the other equivalent names C.utf8
and UTF-8
too.
I think it's worthwhile to not output a million scary error messages in the sage-9.2 release on these systems that have done nothing wrong. At the very least, we owe it to the Arch maintainers who do a lot for sage and would have to field the resulting bug reports (or patch this themselves). I'm sure there are BSDs where this is problematic too.
Arch locales maintainers just need to get C.UTF-8 locale, they are being silly (their argument - "it's an evil coming from Debian" - and I'm told they don't reopen the corresponding issue, as it's "decided". Meanwhile everybody else has C.UTF-8 locale, it's just them who don't)
It's a blocker because it is a regression regarding platform support.
Can we please get a fix done?
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
e5f6663 | only use locale C.UTF-8 if available, else C |
rebased over the latest beta
Description changed:
---
+++
@@ -4,6 +4,17 @@
There are also some other locale problems that show up in doctests
(for example https://groups.google.com/d/msg/sage-release/spalYgXKr-4/ZVsbgHIlAgAJ)
+And a failure building the documentation on `ubuntu-bionic-standard` (using `/usr/bin/python3.6`, https://github.com/mkoeppe/sage/runs/1106251169):
+
+```
+ [dochtml] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2661: ordinal not in range(128)
+ [dochtml] Full log file: logs/dochtml.log
+Makefile:1876: recipe for target 'doc-html' failed
+```
+
+
+
+
See also:
- #22659
Using a freshly installed Ubuntu 18.04 (bionic) and with some french settings set somewhere so that $ git pull
returns Déjà à jour
, a french equivalent for Already up to date
, running make on 9.2.beta12
yields the [dochtml] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2661: ordinal not in range(128)
error, see this post, right from the start of the build of the documentation.
With beta12 + current branch, I still get the same error after make doc-clean && make
.
That being said, I now see where it comes from:
sage: with open('src/doc/en/reference/references/index.rst', 'r') as f: s = f.read()
sage: s[2600:2700]
' characteristic,* The Open Book Series, vol. 2, no. 1, pp. 37–53, Jan. 2019.\n\n.. [ABZ2007] \\R. Aharo'
sage: s[2661:2700]
'–53, Jan. 2019.\n\n.. [ABZ2007] \\R. Aharo'
The character –
has many occurrences in that file and is possibly not the only occurrence of a nonascii character (for example names of authors...). So, I am not sure replacing them by --
is the correct fix. And I don't see how this is related at all with the C.UTF-8
configuration.
So your machine has no C.UTF-8 locale installed, right?
How can I figure this out? If it can help, I have this:
$ locale
LANG=fr_CA.UTF-8
LANGUAGE=fr_CA:fr_FR:en_GB:en
LC_CTYPE="fr_CA.UTF-8"
LC_NUMERIC=fr_FR.UTF-8
LC_TIME=fr_FR.UTF-8
LC_COLLATE="fr_CA.UTF-8"
LC_MONETARY=fr_FR.UTF-8
LC_MESSAGES="fr_CA.UTF-8"
LC_PAPER=fr_FR.UTF-8
LC_NAME=fr_FR.UTF-8
LC_ADDRESS=fr_FR.UTF-8
LC_TELEPHONE=fr_FR.UTF-8
LC_MEASUREMENT=fr_FR.UTF-8
LC_IDENTIFICATION=fr_FR.UTF-8
LC_ALL=
slabbe@miami ~ $ man locale
slabbe@miami ~ $ locale -a
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
fr_BE.utf8
fr_CA.utf8
fr_CH.utf8
french
fr_FR
fr_FR.iso88591
fr_FR.utf8
fr_LU.utf8
POSIX
in
+if test x`locale -a | grep C\.UTF-8` != x; then
+ export LC_ALL=C.UTF-8;
+else
+ export LC_ALL=C;
+fi
bit of this branch, could you change both LC_ALL
to LC_CTYPE
and try if it helps?
Replying to @dimpase:
bit of this branch, could you change both
LC_ALL
toLC_CTYPE
and try if it helps?
Same error after make doc-clean and make.
To me, it seems like an error of the following kind. That is we are opening the src/doc/en/reference/references/index.rst
file as a bytes
type, and at some place (where?), we decode the bytes to ascii and then we get UnicodeDecodeError
because it is not ascii at all. Here is a way to reproduce the same error message:
sage: with open('src/doc/en/reference/references/index.rst', 'rb') as f: b = f.read()
sage: b.decode('ascii')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-13-498050d5a3fb> in <module>
----> 1 b.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2661: ordinal not in range(128)
sage: s = b.decode('utf-8')
I am adding Frédéric in cc since he fixed a lot of those UnicodeDecodeError
in recent times with the passage to Python 3.
Also, why this works in Python3.8 and not in Python3.6 ?
In #29033 in
build/bin/sage-spkg
LC_ALL was changed to C.UTF-8 However, not all systems have it.See also:
22659
Follow-up:
30008 Fix sage-system-python
30586 macOS: Doctest failures in some locales
30576 Python 3.6: Fix locale/encoding issues, then re-enable Python 3.6
CC: @antonio-rojas @mkoeppe @slel @orlitzky @kiwifb @embray @fchapoton
Component: build
Author: Dima Pasechnik, Matthias Koeppe
Branch:
be47518
Reviewer: Matthias Koeppe, Dima Pasechnik
Issue created by migration from https://trac.sagemath.org/ticket/30053