Closed jehturner closed 2 years ago
I'll need to take a bit of a break from troubleshooting these tests because I now have 3 user support questions and an operational installation request pending, amongst other things...
Whow! I am impressed by this debugging strength! Thank you very much for digging into!
What I am not sure about is whether the order within __dict__()
is guaranteed to match the order of appearance in the source code (some discussion here). Without this, I would be afraid that a future Python version will make this re-appear. IMO here it would be definitely worth to have a test for this.
What I am not sure about is whether the order within dict() is guaranteed to match the order of appearance in the source code (some discussion here).
What I read is that __dict__
does explicitly preserve the order now, whereas your link seems to be discussing something slightly different, but I'll check again in a bit.
One question, for the regexp itself: I would guess that the regular expressions should be mutually exclusive, at least when I read the comment in t_string()
? https://github.com/iraf-community/pyraf/blob/7c63d6a2a569afdeefee1622ba4c24ef97daaea4/pyraf/clscan.py#L272-L275
But aren't there just the rectangular brackets [
]
missing in the regexp? If they would be forbidden as well, then the string wouldn't match here. {
}
and (
)
are forbidden, why not [
]
? This would be a bit more robust than depening on the order.
Here it's documented that the order in which class attributes appear in the code is preserved by __dict__
from Python 3.6 onwards:
https://www.python.org/dev/peps/pep-0520/
https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-pep520
This PEP is a little confusing, because the behaviour apparently changed further between the PEP being accepted and the actual 3.6 release (or something like that), due to some contemporaneous changes to the dict
class itself -- see the "Note" before the Abstract. Apparently, a plain Python dict
is now guaranteed to retain its insertion order, as of Python 3.7 (while in CPython 3.6 it began doing so as an implementation detail)! I hadn't noticed that.
Yes, I was wondering whether the regexes are intended to be mutually exclusive and this one was just an oversight -- especially since I'm not seeing other cases that get messed up by the change in ordering and considering that Python 2 only places them in a different order by chance. Also, I imagine that relying on ordering to avoid collisions would get quite complex (if not impossible) to determine. So you're probably right.
I actually hadn't had a close look at the individual regexes yet and just wanted to write down what I'd found in the time available last night, before discussing the intended behaviour further, so maybe I asserted a solution too quickly. If @cdsontag agrees that the regex ordering (ideally) shouldn't matter then perhaps the question is whether there's a good reason why the square brackets are missing from t_string
, making it a special case? Further testing should also give us a better idea.
I had a quick chat with Chris elsewhere and he said "As far as tokens being mutually exclusive, I don’t recall no, but if the code is set up so that they are loaded into a dict then (since it was written in Py2) I think we can assume it was originally written to assume order did not matter".
I tried to check the CL programmers manual for what is allowed in an unquoted string but didn't find a definitive answer:
Strings are zero or more characters surrounded by single or double quotes, ’ or “. The quotes are not needed in two cases. One is in response to a query. In that case, everything up to the end of the typed line is taken to be the string. If the quotes are used, however, they will be discarded. The other case is when specifying the value of a parameter on the command line when running a task. If the corresponding parameter is of type string, filename or is list-structured and the string need not be used in an expression, then the quotes are optional.
So, I played a bit with ecl:
print [abc]
works,print ab]
works, butprint [ab
gives a continuation line until one closes the bracket.So I guess [
isn't meant to universally work, and we could pragmatically exclude it in the regexp.
Just changing the order would however be less invasive.
I'll probably just accept whatever you propose here :grin:
Yes, I'd figured out that it's an odd case, with eg. _BasicScanner_3.t_doublequote
matching the quoted string value in my 2-line CL script above. It's very useful that you found that description of the cases in which an unquoted string is expected to occur. I think we should ideally try to restrict t_string
as you proposed and check that it doesn't cause undue breakage, but I'll try to look at those 2 cases a bit more first. I'm now recalling some usage of single brackets in parameter values to refer to a like-named package parameter, but I think those are parentheses rather than square brackets(?).
Yes, what I'm thinking of is blah=)_.blah
in a task's parameters, meaning "get the value from the host package's blah
parameter instead" (since the paragraph you cited says that string parameter values can be unquoted). But that's not a square bracket...
I think the more obvious issue here may be passing things like image sections or extension names as parameter values? In particular, if you pass something like filename[SCI][*,100]
as an argument, it seems like that would now require quotation marks if we restrict brackets in t_string
, which is quite a major UI change. ~Moreover, I tried changing t_string
to confirm this and pyraf
wouldn't start...~
Yes, so here's the current behaviour:
--> imstat dev$pix[*,100]
# IMAGE NPIX MEAN STDDEV MIN MAX
dev$pix[*,100] 512 96.68 42.15 38. 398.
and here's what you get when t_string
disallows [
and ]
:
--> imstat dev$pix[*,100]
File "<string>", line None
SyntaxError: CL syntax error at `[' (file "string_proc", line 1)
That's not going to work :slightly_frowning_face:.
I think we've found the good reason why the square brackets are missing from t_string
and it's hard to see a better solution than changing the order of the regexes (?), but at least it seems clear that t_string
+ t_lbracket
are a special-case collision.
Ah, you are ofcourse right. So, probably just exchanging the methods is the way to go.
:+1: I do notice that the t_string
in _ComputeStartScanner_1
is much more restricted, but that's obviously a different case (not sure exactly where it's used). Anyway, I think this small re-ordering should work and we do have ~900 tests we can run it on to make sure it doesn't do anything too crazy :slightly_smiling_face:.
A compromise would be to change the regex for t_string
to something like this, which is more convoluted but only disallows [
as the first character:
r'[^ \t\n()\[\\;{}&][^ \t\n()\\;{}&]*(\\(.|\n)[^ \t\n()\\;{}&]*)*'
This would still fail for cases like imcombine.statsec=[*,100]
, with no quotation marks, but so does reversing t_string
and t_lbracket
(and Python 2), because the latter regex eats the first character of the string, so this limitation is not a regression. But given that Python >=3.6 guarantees the __dict__
ordering, I think the difference between the 2 solutions is pretty academic; they are essentially identical.
I agree, and probably, if in doubt we should take the solution that is less invasive. IMO re-ordering keeps the old behavior, which is OK for a bug fix.
I'll try & add some sort of test when I'm back from my help requests etc.
Sorry, one last comment for the record, regarding the original intention...
After digesting this a bit better, it seems clear that the regex precedence is supposed to be determined by the scanner class hierarchy, where all the operators within a given parent class such as _CommandScanner_1
have equal precedence and their order should not matter, but (eg.) the _BasicScanner_3
patterns have precedence over those in _CommandScanner_1
. So the bug here is that t_string
and t_lbracket
both appear in the same scanner class without having mutually exclusive patterns (this is consistent with what Chris said). Thus the canonical solution may be to split t_string
out of _CommandScanner_1
into a new class (if not changing the regex). However, it's significantly simpler in Python 3 just to take advantage of the method ordering, so best do that and move onto the next problem, rather than overcomplicating things. I just thought if anyone refers back to this, it may be helpful to know the conclusion about how it's intended to work.
@jehturner would you create a Pull Request yourself or shall I do it based on this discussion?
I can do it this week; I've just been tied up with other things. I started thinking about how best to run a CL script with pytest. It seems that something like this could work, based on some code in the pyraf
start-up script?
cltext = '''
char cenwavvalue[4]
cenwavvalue[1] = "580"
'''
iraf.task(cmd_line=cltext, IsCmdString=1)
iraf.cmd_line()
I can also check again what our Regression Testing Framework does, though it's really quite convoluted... I suppose there is also Popen(['pyraf', '-c', 'cl < test.cl'])
, since pyraf
does return an exit status of 1 when the error occurs; that's messier but is the closest approximation to how a user would run it. Any thoughts? I don't think you already have any CL-syntax tests, have you? Or did I overlook them when having a glance at what's there?
Actually, I now see one or two cases in test_cli.py
that do something similar to the above, so that's probably the answer...
Yes, a simple test would be
def test_clparser_order(tmpdir):
# Check that square brackets are parsed correctly
# https://github.com/iraf-community/pyraf/issues/115
stdout = io.StringIO()
iraf.task(xyz='char cenwavvalue[4]\n'
'cenwavvalue[1] = "580"',
IsCmdString=True)
iraf.xyz(StdoutAppend=stdout)
assert stdout.getvalue() == ""
As for test_division_task() one could do the test in both ECL and non-ECL mode; however we already know it is independent of it.
Maybe worth adding a print
statement, just to make sure it did the expected thing (eg. cenwavvalue
should be 580 INDEF INDEF INDEF
)?
(I haven't actually checked that the other elements are guaranteed to be initialized in that way, but that's what happens in practice. Or maybe better just test cenwavvalue[1]
, since the above string depends on formatting as well as how uninitialized values are dealt with.)
You are right here; I would however just test the value that was previously set. So, I'll wait until you have time, no problem. Thank you very much for your efforts!
Some of our tests are failing on Python 3 with errors like this:
The problem can be reduced to the following script:
With PyRAF 2.1.15 on Python 2.7.16 (where the error does not occur), this produces the following output, showing how the CL script is supposed to get tokenized:
But on Python 3.7.9 (with either the same PyRAF 2.1.15 or a recent checkout of 2.2.1dev3), it produces the following, where the array subscript on the second line is treated as a single string, '[1]', rather than being split into opening & closing brackets and an integer:
PyRAF matches different types of tokens using a regex for each type, defined in
clscan.py
. These get concatenated together into a long regex for a given "scanner" -- in this case_CommandScanner
-- which is used to determine which of the patterns matches the next character(s). After quite a bit of digging, it seems that the problem is a change in the order of the individual regexes for_CommandScanner
. Specifically, thet_string
regex now appears beforet_lbracket
instead of afterwards, causing[1]
to get matched as a string instead of matching the opening bracket.The order in which the component regexes get concatenated (and hence their precedence) is determined by
_namelist
ingeneric.py
. This iterates over the methods in the scanner class__dict__.keys()
, which behaves differently in Python 2 & 3. In Python 2, it's just an ordinarydict
, so I think its order is undefined but (fortunately) appears to be repeatable... In Python >=3.6, it's amappingproxy
object that preserves the order in which the methods are defined in the code. Sot_string
has moved beforet_lbracket
just because that's how they are written in the (parent) class_CommandScanner_1
.https://github.com/iraf-community/pyraf/blob/7c63d6a2a569afdeefee1622ba4c24ef97daaea4/pyraf/clscan.py#L269-L291
The error can therefore be solved simply by reversing the order of those 2 methods in in
_CommandScanner_1
, puttingt_lbracket
first. I confirm that this makes the error disappear and my test now runs to completion on Python 3 (with some small numerical differences that will be unrelated).I would appreciate any input from @cdsontag here, because although I'm confident that this fixes the problem I was troubleshooting, I'm not sure what the consequences of reordering these regexes might be more generally. I think there can't be many other cases where the parsing is messed up in this way, since most of our tests are passing and I'm not spotting other errors that look similar but not identical. I suppose we could study the regex ordering that Python 2 actually produces and try to reproduce that, which should be safe, but ~I'm not sure whether it's possible to reproduce exactly, given that inheritance will force grouping of the methods within the
__dict__
in Python 3~. Thanks.