acs_destripe_plus with mask loops too many times for no reason

pllim commented 7 years ago

Please investigate ASAP since ACS Team is fond of this task.

From the log at https://ssb.stsci.edu/pandokia/pandokia.cgi?query=detail&key_id=106065924 , you can't tell much. But when you go in the test machine, you will see that jc8o04gh2.tra is a whooping 5.5 MB and the log shows CALACS went into a loop many, many times.

Why is this happening? What has changed since it last worked?

c/c @mcara since he wrote the masking stuff

jamienoss commented 7 years ago

I have never touched this code base, only updated the docs. I believe @mcara would be better assigned to this issue. Especially since it sounds like a regression.

mcara commented 7 years ago

I think it is important to note that:

Tests started failing on May 10 as seen the following screenshot:
The error

os.remove(blctmp_name) OSError(2, 'No such file or directory')

has nothing to do with bitmasks and rather it is due to code trying to delete a file that does not exist. Judging by name blctmp_name it appears that the code is attempting to delete a temporary file produced by hstcal (acs_destripe does run hstcal isn't it? @pllim you worked on that part of the code).

I checked the repository around May 9, 2017 and did not find any changes to neither acstools nor stsci.tools (where bitmasking stuff leaves). The only change to a package that seems relevant to this issue appears to be https://github.com/spacetelescope/hstcal/commit/732d69ce8ae39224bc0c4a58a9fe75586fca97c0 and a couple of other changes earlier changes related to Gen2 CTE.

@jamienoss I think you should re-assign this issue to yourself 😈

mcara commented 7 years ago

For the very least we need to understand how did changes in hstcal affected acstools and then we can figure out how to "fix" acstools if the changes in hstcal are expected and were planned.

mcara commented 7 years ago

I've looked at jc8o04gh2.tra and it is not really "looping": it is simply accumulating logs since circa 2014.

Begin    13-Dec-2014 04:05:41 EST
Begin    14-Dec-2014 00:20:00 EST
.................................

mcara commented 7 years ago

It seems to me that the essential code in acstools is here: https://github.com/spacetelescope/acstools/blob/master/acstools/acs_destripe_plus.py#L513 (actually from L512 to L518). Essentially it is like this:

    if cte_correct:
        acs2d.acs2d(blctmp_name)
        os.remove(blctmp_name) # <-- code fails here

It seems to me that acs2d.acs2d(blctmp_name) is not making the blctmp_name file and then remove() is trying to delete something that does not exist.

What has changed in hstcal or acs2d that it is not producing the files anymore...

mcara commented 7 years ago

@pllim acs_destripe_plus() has a clobber parameter which is False by default. The regression test is not setting clobber=True (I do not know why - ask the author of the wfc_destripe_plus regression test) and therefore acs_destripe_plus() is appending to the *.tra instead of overwriting it.

jamienoss commented 7 years ago

@mcara awesome, thanks for all the detective work. I did wonder if the trailer file was just being appended to since the stone ages rather than having something loop when it shouldn't.

So if it's failing because the *blc_temp.fits is missing this is obviously because it wasn't created and it was probably not created because it is reading in the old (gen1) PCTETAB file causing it to use the gen1 CTE correction which triggers the deprecation error. I imagine all that needs to be updated is the test to use --ctegen 1 (which doesn't exist c.f. #31) or the PCTETAB file and reference.

Why isn't the pandokia log or acstools printing the output from hstcal? I'll look into this next week.

mcara commented 7 years ago

Why isn't the pandokia log or acstools printing the output from hstcal?

Because it is printed to trailer files (*.tra)?

pllim commented 7 years ago

You need special pandokia incantations to capture screen output. I didn't bother to do that in acstools as I did for hstcal. Usually, when something is fixed on hstcal side, acstools is automatically fixed as well. But not this time.

I'm shocked that calacs silently appends to trailer files. The expected behavior is exiting with "output file already exists" error. Or is that only for image outputs? Something to investigate.

pllim commented 7 years ago

BTW, thanks for the investigative work, @mcara !

mcara commented 7 years ago

You are welcome!

I'm shocked that calacs silently appends to trailer files. The expected behavior is exiting with "output file already exists" error. Or is that only for image outputs? Something to investigate.

My feeling is that various other (WFPC2, etc.) calibration tools had treated *.tra files like astrodrizzle treats *.log: they would append new log to the existing file. This definitely needs either further investigation or expert advice of veterans such as @philhodge

mcara commented 7 years ago

@pllim Also, instead of setting clobber=True you might want to consider deleting existing *.tra files at the beginning of the test.

philhodge commented 7 years ago

On 07/24/2017 03:21 PM, Mihai Cara wrote:

My /feeling/ is that various /other/ (WFPC2, etc.) calibration tools had treated |.tra| files like |astrodrizzle| treats |.log|: they would append new log to the existing file. This definitely needs either further investigation of expert advice of veterans @philhodge https://github.com/philhodge

calcos appends to the trailer file. DMS creates the trailer file, after all; you wouldn't want a calxxx to delete what DMS has written. calstis does not explicitly write to a trailer file; it writes to stdout, and DMS appends stdout to their trailer file.

pllim commented 7 years ago

Yes. Deleting trailer file at the beginning of the test is the solution I would go with. Thanks for the advice. I'll look into that when I return to work if no one get to it first.

jamienoss commented 7 years ago

Since this test does not print the output to the pandokia log, the only log of what happened is the trailer file and therefore, I would not delete this pre or post run. Doing so would render the tests (such as this one) useless.

philhodge commented 7 years ago

How about renaming the trailer file with a name that includes a date/time stamp, at the end of processing?

mcara commented 7 years ago

Since this test does not print the output to the pandokia log, the only log of what happened is the trailer file and therefore, I would not delete this pre or post run. Doing so would render the tests (such as this one) useless.

Not true: if calibrated data files are identical to the reference files, usually we consider that test passes. Normally we do not look at log files such as astrodrizzle.log (however, if you now consider that log is important - feel free to modify the tests accordingly). I can't imagine what are you going to do with these log files? read them every day? Same goes for @philhodge 's idea of storing trailer file every day! Better let it grow to 10MB...

mcara commented 7 years ago

Of course trailer should not be deleted post-run but removing old log before creating a new one seems OK.

pllim commented 7 years ago

I agree with Mihai. It's not difficult to capture screen output onto pandokia log either. We can do both.

jamienoss commented 7 years ago

How ever we need to do it doesn't change that it needs to be done and what that is is the persistence of test output. If this becomes volatile as you want then you will have to do the very thing that you suggest sounds 'silly' - looking at the output every day because that's all you will be able to do as that will be its lifetime. If the data persists then this can be looked back at from the future, i.e. not on a daily basis.

The useful situation of a regression test is not to tell you that a test passed but that it failed. The idea of a regression test isn't just to tell you that something regressed but to tell you what and I don't just mean which exe and input file. To do this you would need all the info available to you, i.e. the logs - whatever it is that you would initially see from subsequent dev tests, otherwise, how do you know that you have reproduced the same situation? In the worse case scenario, it may not even be so easily reproduced and the actual output from the initial regressed test may be the only info we have to work with.

jamienoss commented 7 years ago

@pllim Your recent comment

It's not difficult to capture screen output onto pandokia log either.

seems to contradict an earlier one of yours (here)

You need special pandokia incantations to capture screen output. I didn't bother to do that

pllim commented 7 years ago

I don't think keeping old trailer log is necessary but you do. So I'm trying to compromise here. Anyway, I can't do anything until next week.

mcara commented 7 years ago

The useful situation of a regression test is not to tell you that a test passed but that it failed. The idea of a regression test isn't just to tell you that something regressed but to tell you what and I don't just mean which exe and input file.

Are you just saying this in general or specifically referring to the acs_destripe_plus test? I fail to see what this has to do with acs_destripe_plus. Specifically,

This test stared failing on May 10th, 2017 (see https://github.com/spacetelescope/acstools/issues/32#issuecomment-317091316) as it should. It wasn't passing when it should have failed.
"The idea of a regression test isn't just to tell you that something regressed but to tell you what and I don't just mean which exe and input file." What do you exactly mean here a test should tell us? For example, in connection with acs_destripe_plus..., the test failed and the log pointed to the line that was trying to remove a file (which did not exist). With a bit of detective work we found the reason. Were you expecting the test to point to the commit that broke the code/test?
In connection to the previous point, how does keeping all previous trailer files help with the point you are making (about usefulness of regression tests)? I fail to see the need to keep all previous trailer files, specifically:
- acs_destripe_plus is not designed to test calacs (other tests should test calacs): calacs is simply a step in acs_destripe_plus.
- What does yesterday's trailer file provides in addition to what today's file provides? Take a look at jc8o04gh2.tra and scroll back from the end of the file. Did anything (besides dates) change since the test started failing? Does it tell you anything useful about current failure?

I am not saying we should not keep the last trailer and/or other log files. But I fail to see usefulness of keeping all past log files. (To the best of my knowledge) We do not do this for any of the tests and I do not think there is a convincing reason to start doing it now.

jamienoss commented 7 years ago

@mcara Sorry, I mean a bit of both - in general, but more specifically in this case. I'm talking about us needing some sort of log of what actually happened in the test, e.g. the output from hstcal. This would be better if it were in the pandokia logs such as it is for some of the other tests (hstcal again). Since this is not currently the case for these acstools tests (wfc_destripe_plus.py & wfc_acscte.py specifically) we need to keep the trailer files as this contains that very information. For the above specific tests looking at the trailer files tells you exactly what happened for them to error - it gives you the error.

PCTECORR PERFORM
PCTEFILE /grp/hst/cdbs/jref/xa81724cj_cte.fits
Trying to open /grp/hst/cdbs/jref/xa81724cj_cte.fits...
Read in Primary header from /grp/hst/cdbs/jref/xa81724cj_cte.fits...
(pctecorr) Generation 1 PCTETAB file auto-detected.
ERROR:    (pctecorr) Gen1 algorithm detected in PCTETAB (CTE_NAME). Default is gen2, use '--ctegen 1' to override.
ERROR:    Error processing jc8o04gh2_blv_tmp.fits.
ERROR:    Error processing jc8o04gh2_blv_tmp.fits.
ERROR:    Calibration file(s) missing.

I'm not saying that we keep trailer files for all of time just until we have a better replacement. The idea is that if the tests fail but it takes anyone a few days to get around to look at them, more info other than the tests erroring (or failing) needs to be available. The trailer files provide this for these tests but won't if they get deleted. Especially since the email reminder only goes out at 1000 and the next run is 1030 that day - giving you a whopping 30mins to take a look at the trailer files before they are deleted.

Of course trailer should not be deleted post-run but removing old log before creating a new one seems OK.

Removing the old log before the next pre-run is deleting the log post run - only that you are delaying doing so until the next run. It's pretty much the same thing.

What does yesterday's trailer file provides in addition to what today's file provides? Take a look at jc8o04gh2.tra and scroll back from the end of the file. Did anything (besides dates) change since the test started failing? Does it tell you anything useful about current failure?

I think you have misunderstood the critical part of a regression test. The files don't differ from day to day when the tests run, however, they do when something goes wrong and this is the principle of the tests. Yes the tests don't change day to day after they have started to fail but they might - this is the whole point. If we knew in advance how regressions are to be introduced they would be fixed there and then and such tests would not even be needed.

(To the best of my knowledge) We do not do this for any of the tests and I do not think there is a convincing reason to start doing it now.

The convincing reason should be that, currently, these trailer files are the richest source of information relating to what happened during these test.

mcara commented 7 years ago

Look, I think it is OK to treat log files just like any other image file and keep a "reference log file" from a passing test and compare the log of each daily test to that reference file as we already do with fits images [assuming we can skip comparison of date/time stamps]. But storing multiple log files (like what you've been suggesting), is pointless.

I think you have misunderstood the critical part of a regression test. The files don't differ from day to day when the tests run, however, they do when something goes wrong and this is the principle of the tests. Yes the tests don't change day to day after they have started to fail but they might - this is the whole point.

No, I did not. However, I do not see how appending daily trailer files to a file would benefit anyone. Keeping a "good" log for reference is OK but keeping all log files (by appending them to an existing file, or renaming them by date, etc.) - is too much without any gain in information. In fact, the log file of a failing task all by itself should be revealing enough. For example, the trailer file of the failing acs_destripe_plus shows:

CALACSBEG*** ACSCTE -- Version 9.1.0 (21-Feb-2017) ***
Begin    10-May-2017 10:41:37 EDT
Input    jc8o04gh2_blv_tmp.fits
Output   jc8o04gh2_blc_tmp.fits
...............
PCTECORR PERFORM
PCTEFILE /grp/hst/cdbs/jref/xa81724cj_cte.fits
Trying to open /grp/hst/cdbs/jref/xa81724cj_cte.fits...
Read in Primary header from /grp/hst/cdbs/jref/xa81724cj_cte.fits...
(pctecorr) Generation 1 PCTETAB file auto-detected.
ERROR:    (pctecorr) Gen1 algorithm detected in PCTETAB (CTE_NAME). Default is gen2, use '--ctegen 1' to override.
ERROR:    Error processing jc8o04gh2_blv_tmp.fits.
ERROR:    Error processing jc8o04gh2_blv_tmp.fits.
ERROR:    Calibration file(s) missing.

This seems informative enough in itself.

pllim commented 7 years ago

I have fixed the test. It will now use Gen 2 CTE, as it should be. It will also only retain the log of the last run, as Mihai suggested. It is expected to fail because the "truth" is now outdated. I will open a separate issue when I get the result tomorrow. Thank you for all your inputs!

jamienoss commented 7 years ago

@pllim I had already done this by swapping the reference files over - check the SVN log. Only leaving the new file to be created and then ok'd.

spacetelescope / acstools

acs_destripe_plus with mask loops too many times for no reason #32