error: regexp: unrecognized character after (? or (?- at position 6 of expression

cbm755 commented 2 years ago

I got this doctesting Symbolic on Fedora 36. It did not happen when I build on upcoming Fedora 37.

This is Fedora's Octave 6.4.0 (versus Fedora 37 which has 7.1.0). The machine was "ppc64le" and I'm not sure what that is... but I recall we had problem in the past about regexp differences between x86 and arm, so maybe this is similar...

This was with current release doctest v0.7.0: would be nice to test with current main branch but I don't have shell access to the machine :(

Symbolic pkg v3.0.0: Python communication link active, SymPy v1.10.1.
Doctest v0.7.0: this is Free Software without warranty, see source.
error: regexp: unrecognized character after (? or (?- at position 6 of expression
error: called from
    doctest_collect>parse_texinfo at line 489 column 8
    doctest_collect>extract_docstring at line 336 column 26
    doctest_collect>collect_targets_class at line 297 column 38
    doctest_collect at line 146 column 11
    doctest_collect at line 126 column 13
    doctest at line 349 column 11
error: Bad exit status from /var/tmp/rpm-tmp.Fi7FuX (%check)

Upstream: https://koji.fedoraproject.org/koji/taskinfo?taskID=89344353 (not sure how long those logs are kept).

cbm755 commented 2 years ago

re: ppc64le, Wikipedia says:

ppc64le is a pure little-endian mode that has been introduced with the POWER8 as the prime target for technologies provided by the OpenPOWER Foundation, aiming at enabling porting of the x86 Linux-based software with minimal effort.

cbm755 commented 2 years ago

This seems to be causes by LANG=C.

On my own non-PPC64le machine (x86, Fedora 35, Octave 6.4.0), I can reproduce a similar thing using:

$ export LANG=C
$ octave
>> pkg load doctest
>> doctest doctest

Doctest v0.7.0+: this is Free Software without warranty, see source.

error: regexprep: nothing to repeat at position 9 of expression
error: called from
    doctest_collect>parse_texinfo at line 487 column 9
    doctest_collect>extract_docstring at line 343 column 26
    doctest_collect>collect_targets_function at line 257 column 36
    doctest_collect at line 149 column 10
    doctest at line 350 column 11

I wonder if we know about this before? We have utf-8 chars in our regexp, e.g., line 487 is:

      L = regexprep (L, '^(\s*)(?:⇒|=>|⊣|-\||error→|error->)', '$1', 'once', 'lineanchors');

That does sound like A Bad Thing when the local is C...

cbm755 commented 2 years ago

I can reproduce this on Fedora 35 and 36. So far, I cannot reproduce is on Ocfave containers (based on Ubuntu) using:

docker run -it --rm gnuoctave/octave:7.1.0 bash
export LANG=C
octave
pkg install -forge doctest
doctest doctest

Nor can I reproduce it on Ubuntu 20.04 on a real computer.

But I can reproduce it on a Ubuntu:22.04 container:

docker run -it --rm ubuntu:22.04
apt-get update
apt-get install --no-install-recommends  octave octave-doctest
octave
pkg load doctest
doctest doctest

(interestingly, I do not need export LANG=C here...

cbm755 commented 2 years ago

All the systems above using Octave <= 6.4.0. I have not reproduced this using 7.1.0 anywhere.

Workaround: put this before using doctest package

__mfile_encoding__ ('utf-8')

(it seems to work for me whether I do this before or after pkg load doctest as long as I do it before doctest doctest).

Edit, FTR:

docker run -it --rm ubuntu:22.04
apt-get update
apt-get install --no-install-recommends  octave octave-doctest
octave
__mfile_encoding__ ('utf-8')   # workaround
pkg load doctest
doctest doctest

cbm755 commented 2 years ago

For 7.1.0, its possible that DTRT here is to put .oct-config with contents encoding=utf-8. Although I think those are only supported on Octave >= 7 and I cannot reproduce this error there; its hard to tell if that fixes anything or not. But is does seem like The Right Thing!

cbm755 commented 2 years ago

On Octave 6.4.0 on Fedora 35/36, exporting LANG=C changes Octave's __locale_charset__ from UTF-8 to ANSI_X3.4-1968.

Doing the same on the gnuoctave/octave:7.1.0 container DOES NOT change it:

podman run -it --rm gnuoctave/octave:7.1.0 bash
export LANG=C
octave
 __locale_charset__ 
ans = UTF-8

and same with 6.3.0/6.4.0. So that explains (sort of) why I cannot reproduce on Ubuntu 20.04 (which 6.x.0 container images are based on)

mmuetzel commented 2 years ago

Maybe LC_ALL is also set?

cbm755 commented 2 years ago

Bingo! Now I can reproduce it on Ubuntu 20.04 (and the gnu-octave/octave containers based on it)

docker run -it --rm gnuoctave/octave:6.4.0 bash
export LC_ALL=C
octave
>> __locale_charset__ 
  ans = ANSI_X3.4-1968
>> pkg install -forge doctest
>> pkg load doctest 
>> doctest doctest
Doctest v0.7.0: this is Free Software without warranty, see source.

error: regexprep: nothing to repeat at position 9 of expression
error: called from
    doctest_collect>parse_texinfo at line 480 column 9
    doctest_collect>extract_docstring at line 336 column 26
    doctest_collect>collect_targets_function at line 250 column 36
    doctest_collect at line 142 column 10
    doctest at line 349 column 11

mmuetzel commented 2 years ago

You might still need to set __mfile_encoding__ ('utf-8') if you want to make sure that the files that you want to test (as opposed to the sources of doctest) are read as UTF-8. (Even after the changes in #252 are applied.)

That doesn't mean that #252 shouldn't be applied. IIUC, without that change, it would show that error even if the tested files only contained ASCII characters.

Edit: Similarly, you should set __mfile_encoding__ ('CP1252') if you know that the files you'd like to test use that encoding. That would require #252 to work correctly though.

cbm755 commented 2 years ago

I'm leaning toward that being the user's problem... Unless we want to define that Doctest only reads utf-8 encoded files (I don't think we do).

I think we will want some unit tests of CP1252 encoded input working correctly.

We also need to document this or at least give some hints in help doctest.

I haven't found it documented in upstream Octave, maybe it .oct-config should be mentioned in help __mfile_encoding__ (although in the help of a hidden function seems not quite right...)

mmuetzel commented 2 years ago

I haven't found it documented in upstream Octave, maybe it .oct-config should be mentioned in help __mfile_encoding__ (although in the help of a hidden function seems not quite right...)

It is mentioned in the documentation of dir_encoding.

gnu-octave / octave-doctest

error: regexp: unrecognized character after (? or (?- at position 6 of expression #251