sot / starcheck

BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Replace Inline::Python with a light Python function server #402

Closed taldcroft closed 1 year ago

taldcroft commented 1 year ago

Description

Driven by problems getting Inline::Python to work with ska3-prime, this PR replaces all the inline Python calls with a call to a new lightweight server process that runs Python function calls. This is basically the idea that @javierggt for an API server, but it users an even lighter framework (than e.g. flask) for performance. The server gets called at least thousands of times currently, though there are probably opportunities to reduce that with some code changes.

Current testing status is that it runs to completion on the DEC1922 loads and produces output that looks about right.

Process management for the Python function server (starcheck/server.py) took a little time to work out but it seems to be working, more or less. There might still be room for improvement.

The new server includes reasonable exception handling.

To do:

Interface impacts

Testing

Unit tests

Functional tests

Output comparison

Ran starcheck for DEC1922A loads in this branch and in master:

$ ./sandbox_starcheck -dir $SKA/data/mpcrit1/mplogs/2022/DEC1922/oflsa -run_start_time=2022:352:00:00:00 -out out_{pyserver,master}

Then:

ska3) ➜  starcheck git:(no-inline-python) diff out_pyserver.txt out_master.txt
1,2c1,2
<  ------------  Starcheck 13.17.1.dev21+gd6087bd.d20230101    -----------------
<  Run on Sun Jan  1 06:45:26 EST 2023 by aldcroft from daze
---
>  ------------  Starcheck 13.17.1.dev9+g6057b15    -----------------
>  Run on Sat Dec 31 10:35:06 EST 2022 by aldcroft from daze

On ska3-flight HEAD linux

Functional and regression tests

Ran these in https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/

# Long week and IR Zone holds
starcheck -dir /data/mpcrit1/mplogs/2022/DEC2622/oflsa -out dec2622a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/DEC2622/oflsa -out dec2622a_test
/proj/sot/ska/bin/diff2html dec2622a_flight.txt dec2622a_test.txt > dec2622a_diff.html

# Monitor star
starcheck -dir /data/mpcrit1/mplogs/2022/NOV2822/oflsa -out nov2822a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/NOV2822/oflsa -out nov2822a_test
/proj/sot/ska/bin/diff2html nov2822a_flight.txt nov2822a_test.txt > nov2822a_diff.html

# Maneuver only loads
starcheck -dir /data/mpcrit1/mplogs/2022/OCT2422/oflsb -out oct2422b_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/OCT2422/oflsb -out oct2422b_test
/proj/sot/ska/bin/diff2html oct2422b_flight.txt oct2422b_test.txt > oct2422b_diff.html

# Replan
starcheck -dir /data/mpcrit1/mplogs/2021/MAY0521/oflsa -out may0521a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2021/MAY0521/oflsa -out may0521a_test
/proj/sot/ska/bin/diff2html may0521a_flight.txt may0521a_test.txt > may0521a_diff.html

To functionally test the change in the bad pixel check, I introduced a bad pixel near a guide star and compared outputs vs master.

https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_master_badpix.html#obsid25550

https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_noinline_badpix.html#obsid25550

Check is still functional.

Run of full regression test set revealed latent issue with #389 fixed in 5be93b4. I compared this PR code to master + that same fix and the regression outputs show only two tiny, understood, and acceptable changes related to the PR update to the bad pixel test.

********************* 2005/JUL1105/oflsc ********************
********************* 2005/AUG2705/oflsb ********************
********************* 2005/NOV0705/oflsb ********************
********************* 2005/MAR0705/oflsb ********************
********************* 2006/MAR0606/oflsc ********************
********************* 2006/NOV1306/oflsa ********************
********************* 2006/AUG0706/oflsb ********************
********************* 2006/DEC2506/oflsc ********************
********************* 2006/MAR0606/oflsc ********************
********************* 2006/NOV2006/oflsb ********************
********************* 2007/MAR0507/oflsa ********************
********************* 2007/AUG0607/oflsa ********************
********************* 2007/AUG1407/oflsa ********************
********************* 2007/DEC1007/oflsb ********************
********************* 2007/MAR0507/oflsa ********************
********************* 2007/SEP0307/oflsa ********************
********************* 2008/JUL0708/oflsb ********************
********************* 2008/MAY0508/oflsa ********************
********************* 2008/AUG1808/oflsa ********************
********************* 2008/SEP0108/oflsb ********************
********************* 2008/SEP2908/oflsb ********************
********************* 2009/APR2009/oflsc ********************
********************* 2009/FEB1609/oflsa ********************
********************* 2009/FEB2309/oflsc ********************
********************* 2009/JUL0609/oflsb ********************
********************* 2009/JUN2209/oflsb ********************
********************* 2009/NOV3009/oflsb ********************
--- /home/jeanconn/git/starcheck_noinline/test_regress/release/2009/NOV3009/oflsb/starcheck.txt 2023-01-05 17:46:27.310431000 -0500
+++ /home/jeanconn/git/starcheck_noinline/test_regress/fido.cfa.harvard.edu_48d754734f76c6d488de6d18f96188ba4380c1e4/2009/NOV3009/oflsb/starcheck.txt   2023-01-05 23:10:38.313098000 -0500
@@ -81,7 +81,7 @@
 OBSID = 11935 at 2009:339:14:45:41.729   7.5 ACQ | 5.0 GUI | Critical:11 Caution:16
 OBSID = 11936 at 2009:339:15:53:45.926   7.6 ACQ | 5.0 GUI | Critical:10 Caution: 9
 OBSID = 11937 at 2009:339:16:35:42.174   7.6 ACQ | 5.0 GUI | Critical: 9 Caution: 8
-OBSID = 11938 at 2009:339:17:00:52.174   6.5 ACQ | 5.0 GUI | Critical:11 Caution:12
+OBSID = 11938 at 2009:339:17:00:52.174   6.5 ACQ | 5.0 GUI | Critical:10 Caution:12
 OBSID = 11939 at 2009:339:17:26:02.174   7.7 ACQ | 5.0 GUI | Critical:10 Caution: 9
 OBSID = 12037 at 2009:339:17:51:13.756   7.3 ACQ | 5.0 GUI | Critical:11 Caution:11
          ------  2009:340:03:45:00.000   OBC Load Segment Begins     CL340:0304 
@@ -1905,7 +1905,6 @@
 >> CRITICAL: [ 4] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 5] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 6] Readout Size. 6x6 Should be 8x8
->> CRITICAL: [ 7] Nearby ACA bad pixel.  row, col (-319, -299), dy, dz (5, 44) 
 >> CRITICAL: [ 7] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 8] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 9] Readout Size. 6x6 Should be 8x8
********************* 2009/OCT0509/oflsa ********************
********************* 2009/DEC2109/oflsb ********************
********************* 2010/APR1110/oflsb ********************
********************* 2010/APR1210/oflsa ********************
********************* 2010/JAN1110/oflsa ********************
********************* 2010/JUL0510/oflsb ********************
********************* 2010/OCT1110/oflsb ********************
********************* 2010/OCT2510/oflsb ********************
********************* 2011/JAN1711/oflsa ********************
********************* 2011/MAR1411/oflsa ********************
********************* 2011/APR0411/oflsa ********************
********************* 2011/DEC1211/oflsa ********************
********************* 2012/JAN3012/oflsa ********************
********************* 2013/JUL2913/oflsa ********************
********************* 2014/JAN2514/oflsa ********************
********************* 2015/JAN1215/oflsb ********************
--- /home/jeanconn/git/starcheck_noinline/test_regress/release/2015/JAN1215/oflsb/starcheck.txt 2023-01-05 18:45:24.075190000 -0500
+++ /home/jeanconn/git/starcheck_noinline/test_regress/fido.cfa.harvard.edu_48d754734f76c6d488de6d18f96188ba4380c1e4/2015/JAN1215/oflsb/starcheck.txt   2023-01-05 23:52:48.696908000 -0500
@@ -1063,7 +1063,7 @@
 [ 7]  6   185210664   BOT  8x8   0.955   9.205  10.703   1782    555  20   1  120          
 [ 8]  7   185339360   BOT  8x8   0.973   8.591  10.094   -407  -2098  20   1  120          

->> CRITICAL: [ 6] Nearby ACA bad pixel.  row, col (-318, -296), dy, dz (2, 25) 
+>> CRITICAL: [ 6] Nearby ACA bad pixel.  row, col (-317, -296), dy, dz (2, 25) 
 >> CAUTION : [1] Guide sum mag diff from agasc mag   0.08276
 >> CAUTION : [2] Guide sum mag diff from agasc mag   0.28450
 >> CAUTION : [3] Guide sum mag diff from agasc mag   0.02349
********************* 2015/DEC1115/oflsa ********************
********************* 2015/DEC1115/oflsb ********************

Performance

This branch runs about 50% slower. For DEC1922A this means about 90 seconds in this branch vs. 60 seconds in master.

jeanconn commented 1 year ago

Great and quick idea. It seems to me that the concerns are speed and security. Speed seems reasonable. For security, I like that it "only allow functions in the public API of starcheck module". Do we also need to either limit access to the server with key or obscurity? And/or run the Python in a sandbox or container, or as unprivileged user?

taldcroft commented 1 year ago

Maybe I'm being naive, but I think an attack would need this:

It also just occurred to me that the starcheck script could generate a random key and pass that to the server on creation and then require that key for commands, so that might pretty much make this lock tight.

jeanconn commented 1 year ago

Yeah, that's what I meant by "key" as the first choice. It wouldn't need to be SSL (it would just be a startup generated "password") . And you'd add obscurity by using an available port instead of the hardcoded one (which you mention as the third box in the todo items anyway).

jeanconn commented 1 year ago

I also wasn't sure about IPC vs TCP for this kind of app.

jeanconn commented 1 year ago

And despite my "It wouldn't need to be SSL" comment, I don't know how much slower this would be if it used standard key security via SSL/TLS.

taldcroft commented 1 year ago

OK, now the communication between client and server is secured by a shared random key that generated by starcheck.pl at the beginning. I think this addresses any credible security issues.

Also the port is now randomly selected by the OS.

taldcroft commented 1 year ago

@jeanconn - apart from details like logging, this is ready for real review. In order to improve performance I made some code changes that you need to review carefully.

Along the way I reformatted sections of code that I was touching since I just couldn't see the logic otherwise. It might just be worth running perltidy at some point.

jeanconn commented 1 year ago

I think this is really great, but common usage for me with starcheck has been to "ctrl-c" all the time when I realize I've run with the wrong options or whatever. I just had 8 of these python servers going. So it seems like the python server needs to check regularly if the parent process is alive and also have some kind of reasonable timeout. And/or the perl could handle this signal and do the cleanup?

jeanconn commented 1 year ago

(well, and also tell the user not to ctrl-c but it should be robust to that).

taldcroft commented 1 year ago

I spent time playing around this morning as well. My first realization is that using this fork trick is not apparently required. There was a point in the development when I had to do that, but I think the implementation has changed in the meantime.

With 79c351f, the processes all seem to get cleaned up in good order. This included running to completion, Ctrl-C, and kill -9 of the sandbox.

@jeanconn - can you try checking out 2502469 and see if you can manage to start up starcheck and get orphaned child processes hanging around? I could not (on MacOS of course).

jeanconn commented 1 year ago

So I guess you didn't see my comments before starting your work. I'd hoped to save time but. Your current implementation matches what I had except I'd moved to using signal alarm instead of trying to figure out how to actually have handle_timeout do something.

taldcroft commented 1 year ago

Your comments did not indicate having solved the problem.

jeanconn commented 1 year ago

It solved the problem but removed the extra perl fork. I was trying to figure out why the extra perl fork was there but that seems lost to history.

taldcroft commented 1 year ago

Bummer, guess I wasted my entire morning. Your comments were off track, the suggestion would not have worked (applied literally), and you didn't push any commits or say that you had something which solved the problem.

jeanconn commented 1 year ago

My comments were asking why there was a perl fork before I removed it.

jeanconn commented 1 year ago

Sorry about the suggestions. I have not tried those before so I cut and paste I guess that's not how it works.

jeanconn commented 1 year ago

I also thought ball was in my court so had just reached out trying to make sure I didn't go down a path (removing perl fork) that had different problems.

taldcroft commented 1 year ago

Well anyway the commits I have pushed should work and finish this up. Actually I think the first commit alone is enough.

jeanconn commented 1 year ago

Oh well. My lesson learned is just to push the commits even if possibly controversial. And it was a little bit of a tricky problem so the independent solutions arriving at the same place is a good sign.

taldcroft commented 1 year ago

Update to what I said. When I start the sandbox you have processes like so:

 7848 ttys006    0:00.00 /bin/bash ./sandbox_starcheck -dir /Users/aldcroft/ska/data/mpcrit1/mplogs/2022/DEC1922/
 7854 ttys006    0:00.09 perl -I ./starcheck/src/lib ./starcheck/src/starcheck.pl -dir /Users/aldcroft/ska/data/m
 7855 ttys006    0:04.31 python -m starcheck.server

If I kill the perl (7854) then that leaves an orphaned python -m starcheck.server. If I will the wrapper 7848 then that cleans up the python.

taldcroft commented 1 year ago

Well at least say out loud that you have something which is working. Your questions and suggestion gave the impression you were still lost.

jeanconn commented 1 year ago

I also will try more github suggestion practice. I'd highlighted the whole code block to replace but apparently that didn't take.

taldcroft commented 1 year ago

And with the server timeout commit 11fe68c then the Python cleans up even with a kill -9 of the other perl processes. I had the timeout set short 2 seconds for testing but changed it to 15 seconds for this test.

jeanconn commented 1 year ago

But yes, my conclusion was that if the extra perl fork was not needed by something else then removing it had the two advantages of letting primary process kill the python server in "nominal operations" and having everything apparently die on ctrl-c.

taldcroft commented 1 year ago

Anyway you can do more testing and start down the road to making this a production PR. Push early and often. :smile: I'll step away.

jeanconn commented 1 year ago

With the handle_request / handle_timeout timeout strategy (so the timeout applies after N seconds of no requests to the server) is the 2sec timeout basically fine for production? Or should this be something like 2 minutes?

taldcroft commented 1 year ago

I think the timeout should be the latter, at least 2 minutes. Basically we don't want any chance it can time out for normal operation.

jeanconn commented 1 year ago

Sounds good. I couldn't think of anything that starcheck would do at this point that wouldn't need Python within a couple of seconds, but it is best to just set the timeout in minutes instead of doing a full assessment of the "time between python calls".

jeanconn commented 1 year ago

For the action item of " Proper logging not print statements" it is hard to know what makes sense. The Perl side is a mess of print statements. The new python code could have better logging, but would we really want to use utils.config_logging for this? One can't use it to start logging the server startup, for example, because one needs to start up the Python server to do things like configure a logger. I'm not sure how much value it would add here.

taldcroft commented 1 year ago

About logging, just whatever fits with the existing standards for starcheck is fine. I didn't really look at what the rest of the Perl code is doing. For the Python code you could just make a logger with ska_helpers.logger.basic_logger, with a default log level of INFO and then all logger.debug(..) statements. The main point is have code in there to make it easy to debug in a dev environment if something goes wrong. In theory you could have the main command line logging level control this but probably not worth it initially.

javierggt commented 1 year ago

I am not going into the details of this PR, just following the comments, and I wanted to say that I just used it with this weeks loads without trouble. And by that I mean:

jeanconn commented 1 year ago

My testing is not done, but I've updated the PR with some linux functional and regression test notes.

taldcroft commented 1 year ago

The testing looks quite thorough, thanks!

taldcroft commented 1 year ago

About the decode's, good catch. Did you search for any outstanding decode() calls?

Was that code actually being run? I would assume that it would fail with an exception always if it ran.

jeanconn commented 1 year ago

Yes, I searched for other decodes and found none. The characteristics date code with the decode that was hanging around is only called for some historical schedules (so I fixed it so more regression weeks would run to completion). We could perhaps cut it at this point and define a will-run-only-on-schedules-after date for starcheck and document. But I figured not for this release.

jeanconn commented 1 year ago

As noted above I just found a tiny defect in a change of mine, but I still think this is fine to go. I just don't want to merge as some of this is my code with only (recorded) self-review.