`CL syntax error at '='` on Py 3 when assigning value to array element

jehturner commented 2 years ago

Some of our tests are failing on Python 3 with errors like this:

CL syntax error at `=' (file "/rtfproc/rtftests/gemini_iraf/gmosspec2/testgsreduction/test6/testgsreduction_test6.cl", line 111)

The problem can be reduced to the following script:

from pyraf import clscan

cltext = '''
char cenwavvalue[4]
cenwavvalue[1] = "580"
'''

scanner = clscan.CLScanner()
tokens = scanner.tokenize(cltext)

print(tokens)

With PyRAF 2.1.15 on Python 2.7.16 (where the error does not occur), this produces the following output, showing how the CL script is supposed to get tokenized:

[char, cenwavvalue, [, 4, ], NEWLINE, cenwavvalue, [, 1, ], =, '580', NEWLINE]

But on Python 3.7.9 (with either the same PyRAF 2.1.15 or a recent checkout of 2.2.1dev3), it produces the following, where the array subscript on the second line is treated as a single string, '[1]', rather than being split into opening & closing brackets and an integer:

[char, cenwavvalue, [, 4, ], NEWLINE, cenwavvalue, '[1]', =, '580', NEWLINE]

PyRAF matches different types of tokens using a regex for each type, defined in clscan.py. These get concatenated together into a long regex for a given "scanner" -- in this case _CommandScanner -- which is used to determine which of the patterns matches the next character(s). After quite a bit of digging, it seems that the problem is a change in the order of the individual regexes for _CommandScanner. Specifically, the t_string regex now appears before t_lbracket instead of afterwards, causing [1] to get matched as a string instead of matching the opening bracket.

The order in which the component regexes get concatenated (and hence their precedence) is determined by _namelist in generic.py. This iterates over the methods in the scanner class __dict__.keys(), which behaves differently in Python 2 & 3. In Python 2, it's just an ordinary dict, so I think its order is undefined but (fortunately) appears to be repeatable... In Python >=3.6, it's a mappingproxy object that preserves the order in which the methods are defined in the code. So t_string has moved before t_lbracket just because that's how they are written in the (parent) class _CommandScanner_1.

https://github.com/iraf-community/pyraf/blob/7c63d6a2a569afdeefee1622ba4c24ef97daaea4/pyraf/clscan.py#L269-L291

The error can therefore be solved simply by reversing the order of those 2 methods in in _CommandScanner_1, putting t_lbracket first. I confirm that this makes the error disappear and my test now runs to completion on Python 3 (with some small numerical differences that will be unrelated).

I would appreciate any input from @cdsontag here, because although I'm confident that this fixes the problem I was troubleshooting, I'm not sure what the consequences of reordering these regexes might be more generally. I think there can't be many other cases where the parsing is messed up in this way, since most of our tests are passing and I'm not spotting other errors that look similar but not identical. I suppose we could study the regex ordering that Python 2 actually produces and try to reproduce that, which should be safe, but ~I'm not sure whether it's possible to reproduce exactly, given that inheritance will force grouping of the methods within the __dict__ in Python 3~. Thanks.

jehturner commented 2 years ago

I'll need to take a bit of a break from troubleshooting these tests because I now have 3 user support questions and an operational installation request pending, amongst other things...

olebole commented 2 years ago

Whow! I am impressed by this debugging strength! Thank you very much for digging into!

What I am not sure about is whether the order within __dict__() is guaranteed to match the order of appearance in the source code (some discussion here). Without this, I would be afraid that a future Python version will make this re-appear. IMO here it would be definitely worth to have a test for this.

jehturner commented 2 years ago

What I am not sure about is whether the order within dict() is guaranteed to match the order of appearance in the source code (some discussion here).

What I read is that __dict__ does explicitly preserve the order now, whereas your link seems to be discussing something slightly different, but I'll check again in a bit.

olebole commented 2 years ago

One question, for the regexp itself: I would guess that the regular expressions should be mutually exclusive, at least when I read the comment in t_string()? https://github.com/iraf-community/pyraf/blob/7c63d6a2a569afdeefee1622ba4c24ef97daaea4/pyraf/clscan.py#L272-L275

But aren't there just the rectangular brackets [ ] missing in the regexp? If they would be forbidden as well, then the string wouldn't match here. { } and ( ) are forbidden, why not [ ]? This would be a bit more robust than depening on the order.

jehturner commented 2 years ago

Here it's documented that the order in which class attributes appear in the code is preserved by __dict__ from Python 3.6 onwards: https://www.python.org/dev/peps/pep-0520/ https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-pep520

This PEP is a little confusing, because the behaviour apparently changed further between the PEP being accepted and the actual 3.6 release (or something like that), due to some contemporaneous changes to the dict class itself -- see the "Note" before the Abstract. Apparently, a plain Python dict is now guaranteed to retain its insertion order, as of Python 3.7 (while in CPython 3.6 it began doing so as an implementation detail)! I hadn't noticed that.

jehturner commented 2 years ago

Yes, I was wondering whether the regexes are intended to be mutually exclusive and this one was just an oversight -- especially since I'm not seeing other cases that get messed up by the change in ordering and considering that Python 2 only places them in a different order by chance. Also, I imagine that relying on ordering to avoid collisions would get quite complex (if not impossible) to determine. So you're probably right.

I actually hadn't had a close look at the individual regexes yet and just wanted to write down what I'd found in the time available last night, before discussing the intended behaviour further, so maybe I asserted a solution too quickly. If @cdsontag agrees that the regex ordering (ideally) shouldn't matter then perhaps the question is whether there's a good reason why the square brackets are missing from t_string, making it a special case? Further testing should also give us a better idea.

jehturner commented 2 years ago

I had a quick chat with Chris elsewhere and he said "As far as tokens being mutually exclusive, I don’t recall no, but if the code is set up so that they are loaded into a dict then (since it was written in Py2) I think we can assume it was originally written to assume order did not matter".

olebole commented 2 years ago

I tried to check the CL programmers manual for what is allowed in an unquoted string but didn't find a definitive answer:

Strings are zero or more characters surrounded by single or double quotes, ’ or “. The quotes are not needed in two cases. One is in response to a query. In that case, everything up to the end of the typed line is taken to be the string. If the quotes are used, however, they will be discarded. The other case is when specifying the value of a parameter on the command line when running a task. If the corresponding parameter is of type string, filename or is list-structured and the string need not be used in an expression, then the quotes are optional.

So, I played a bit with ecl:

print [abc] works,
print ab] works, but
print [ab gives a continuation line until one closes the bracket.

So I guess [ isn't meant to universally work, and we could pragmatically exclude it in the regexp. Just changing the order would however be less invasive. I'll probably just accept whatever you propose here :grin:

jehturner commented 2 years ago

Yes, I'd figured out that it's an odd case, with eg. _BasicScanner_3.t_doublequote matching the quoted string value in my 2-line CL script above. It's very useful that you found that description of the cases in which an unquoted string is expected to occur. I think we should ideally try to restrict t_string as you proposed and check that it doesn't cause undue breakage, but I'll try to look at those 2 cases a bit more first. I'm now recalling some usage of single brackets in parameter values to refer to a like-named package parameter, but I think those are parentheses rather than square brackets(?).

jehturner commented 2 years ago

Yes, what I'm thinking of is blah=)_.blah in a task's parameters, meaning "get the value from the host package's blah parameter instead" (since the paragraph you cited says that string parameter values can be unquoted). But that's not a square bracket...

jehturner commented 2 years ago

I think the more obvious issue here may be passing things like image sections or extension names as parameter values? In particular, if you pass something like filename[SCI][*,100] as an argument, it seems like that would now require quotation marks if we restrict brackets in t_string, which is quite a major UI change. ~Moreover, I tried changing t_string to confirm this and pyraf wouldn't start...~

jehturner commented 2 years ago

Yes, so here's the current behaviour:

--> imstat dev$pix[*,100]
#               IMAGE      NPIX      MEAN    STDDEV       MIN       MAX
       dev$pix[*,100]       512     96.68     42.15       38.      398.

and here's what you get when t_string disallows [ and ]:

--> imstat dev$pix[*,100]
  File "<string>", line None
SyntaxError: CL syntax error at `[' (file "string_proc", line 1)

That's not going to work :slightly_frowning_face:.

I think we've found the good reason why the square brackets are missing from t_string and it's hard to see a better solution than changing the order of the regexes (?), but at least it seems clear that t_string + t_lbracket are a special-case collision.

olebole commented 2 years ago

Ah, you are ofcourse right. So, probably just exchanging the methods is the way to go.

jehturner commented 2 years ago

:+1: I do notice that the t_string in _ComputeStartScanner_1 is much more restricted, but that's obviously a different case (not sure exactly where it's used). Anyway, I think this small re-ordering should work and we do have ~900 tests we can run it on to make sure it doesn't do anything too crazy :slightly_smiling_face:.

jehturner commented 2 years ago

A compromise would be to change the regex for t_string to something like this, which is more convoluted but only disallows [ as the first character:

r'[^ \t\n()\[\\;{}&][^ \t\n()\\;{}&]*(\\(.|\n)[^ \t\n()\\;{}&]*)*'

This would still fail for cases like imcombine.statsec=[*,100], with no quotation marks, but so does reversing t_string and t_lbracket (and Python 2), because the latter regex eats the first character of the string, so this limitation is not a regression. But given that Python >=3.6 guarantees the __dict__ ordering, I think the difference between the 2 solutions is pretty academic; they are essentially identical.

olebole commented 2 years ago

I agree, and probably, if in doubt we should take the solution that is less invasive. IMO re-ordering keeps the old behavior, which is OK for a bug fix.

jehturner commented 2 years ago

I'll try & add some sort of test when I'm back from my help requests etc.

jehturner commented 2 years ago

Sorry, one last comment for the record, regarding the original intention...

After digesting this a bit better, it seems clear that the regex precedence is supposed to be determined by the scanner class hierarchy, where all the operators within a given parent class such as _CommandScanner_1 have equal precedence and their order should not matter, but (eg.) the _BasicScanner_3 patterns have precedence over those in _CommandScanner_1. So the bug here is that t_string and t_lbracket both appear in the same scanner class without having mutually exclusive patterns (this is consistent with what Chris said). Thus the canonical solution may be to split t_string out of _CommandScanner_1 into a new class (if not changing the regex). However, it's significantly simpler in Python 3 just to take advantage of the method ordering, so best do that and move onto the next problem, rather than overcomplicating things. I just thought if anyone refers back to this, it may be helpful to know the conclusion about how it's intended to work.

olebole commented 2 years ago

@jehturner would you create a Pull Request yourself or shall I do it based on this discussion?

jehturner commented 2 years ago

I can do it this week; I've just been tied up with other things. I started thinking about how best to run a CL script with pytest. It seems that something like this could work, based on some code in the pyraf start-up script?

cltext = '''
char cenwavvalue[4]
cenwavvalue[1] = "580"
'''

iraf.task(cmd_line=cltext, IsCmdString=1)
iraf.cmd_line()

I can also check again what our Regression Testing Framework does, though it's really quite convoluted... I suppose there is also Popen(['pyraf', '-c', 'cl < test.cl']), since pyraf does return an exit status of 1 when the error occurs; that's messier but is the closest approximation to how a user would run it. Any thoughts? I don't think you already have any CL-syntax tests, have you? Or did I overlook them when having a glance at what's there?

jehturner commented 2 years ago

Actually, I now see one or two cases in test_cli.py that do something similar to the above, so that's probably the answer...

https://github.com/iraf-community/pyraf/blob/7c63d6a2a569afdeefee1622ba4c24ef97daaea4/pyraf/tests/test_cli.py#L41-L49

olebole commented 2 years ago

Yes, a simple test would be

def test_clparser_order(tmpdir):
    # Check that square brackets are parsed correctly
    # https://github.com/iraf-community/pyraf/issues/115
    stdout = io.StringIO()
    iraf.task(xyz='char cenwavvalue[4]\n'
              'cenwavvalue[1] = "580"',
              IsCmdString=True)
    iraf.xyz(StdoutAppend=stdout)
    assert stdout.getvalue() == ""

As for test_division_task() one could do the test in both ECL and non-ECL mode; however we already know it is independent of it.

jehturner commented 2 years ago

Maybe worth adding a print statement, just to make sure it did the expected thing (eg. cenwavvalue should be 580 INDEF INDEF INDEF)?

(I haven't actually checked that the other elements are guaranteed to be initialized in that way, but that's what happens in practice. Or maybe better just test cenwavvalue[1], since the above string depends on formatting as well as how uninitialized values are dealt with.)

olebole commented 2 years ago

You are right here; I would however just test the value that was previously set. So, I'll wait until you have time, no problem. Thank you very much for your efforts!

iraf-community / pyraf

`CL syntax error at '='` on Py 3 when assigning value to array element #115