Closed maxwnewcomer closed 3 years ago
Hey @maxwnewcomer - Im no expert with tracking and killing subprocesses (and children thereof) but it looks there are some options for this situation:
https://stackoverflow.com/questions/2638909/killing-a-subprocess-including-its-children-from-python
Ive done something similar in c++ using the standard signal kill() command by passing the pid as negative. So I think if we sort out how to get python to under-the-hood make the same call to the c runtime kill() command, that might be the trick?
Here is how I did it in c++ to deal with linux/macos and windows differences:
Yeah, I think that might be the solution, the only issue is catching the timeout error. Maybe in objective.py instead of
@timeout(timelim)
def runModel(pathtonwt, initnwt):
global cwd
global namefile
copyfile(pathtonwt, os.path.join(cwd, initnwt))
print('[INFO] Starting run.sh out of ', cwd)
call(['./run.sh'], cwd = cwd, shell = False)
return True`
it could be
@timeout(timelim)
def runModel(pathtonwt, initnwt):
global cwd
global namefile
copyfile(pathtonwt, os.path.join(cwd, initnwt))
print('[INFO] Starting run.sh out of ', cwd)
try:
p = call(['./run.sh'], cwd = cwd, shell = False)
except TimeoutError:
p.kill() # or something with similar functionality
return True
I'll give it a shot
Im not expert here but I think the easiest path would be to call modflow directly with subprocess.Popen()
(not thru run.sh
) which should give you an asynchronous handle to the modflow pid. Then in a while loop, you, sleep for bit, check that the subprocess return, or once the timeout threshold is reached, break out of the loop and call os.kill()
on that pid. I think the issue is that run.sh
is spawning modflow as a child subprocess and python is struggling to recognize that...
@jtwhite79 Apparently, the issue with the timeout could be fixed with changing the subprocess.call() function to a subprocess.check_output(), which as of python 3.3 has built in timeout capabilities. So, the timeout is now currently working in my test branch, but still having issues with the run.sh file. Sometimes it works fine and MODFLOW is started and ran to completion. Sometimes, almost arbitrarily, it throws a 127 error code when [./run.sh] is ran from the check_output() function. 127 stands for command not found at it's path. So kinda confused, but I'll be whipping up some local test cases tonight to try to iron it out. Once that's done I'll put a merge in and it should be pretty good to go. I just need to focus on the master script and some better documentation.
Also using the timeout through check_output reduces the wrapt_timeout_decorator dependency, so I'll start an issue for that to make sure I get to removing that from the requirements files.
great news Max, sounds like you are pretty close!
HWR
From: Max Newcomer @.> Sent: Wednesday, September 15, 2021 5:01 PM To: maxwnewcomer/NWTOPT @.> Cc: Subscribed @.***> Subject: [EXTERNAL] Re: [maxwnewcomer/NWTOPT] timeout isn't working as intended (#6)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@jtwhite79https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjtwhite79&data=04%7C01%7Chwreeves%40usgs.gov%7C937e275f923e4ab6d2ae08d9788c022f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637673365006628531%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gEecKT0iqmv9%2Bwk69jSV3EKbTk5sYnOO9wAJmVaXoWU%3D&reserved=0 Apparently, the issue with the timeout could be fixed with changing the subprocess.call() function to a subprocess.check_output(), which as of python 3.3 has built in timeout capabilities. So, the timeout is now currently working in my test branch, but still having issues with the run.sh file. Sometimes it works fine and MODFLOW is started and ran to completion. Sometimes, almost arbitrarily, it throws a 127 error code when [./run.sh] is ran from the check_output() function. 127 stands for command not found at it's path. So kinda confused, but I'll be whipping up some local test cases tonight to try to iron it out. Once that's done I'll put a merge in and it should be pretty good to go. I just need to focus on the master script and some better documentation.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaxwnewcomer%2FNWTOPT%2Fissues%2F6%23issuecomment-920372553&data=04%7C01%7Chwreeves%40usgs.gov%7C937e275f923e4ab6d2ae08d9788c022f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637673365006638487%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jRkRJp18AjK3H1AF3R7F3vUdC1PkDPDzSDWwOob8D60%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMHEL6RPBHJPUK4TDJICH3UCECSJANCNFSM5DVLGAYQ&data=04%7C01%7Chwreeves%40usgs.gov%7C937e275f923e4ab6d2ae08d9788c022f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637673365006638487%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BqJ%2BpUAP%2BqXAvbw3b0alY%2F3P40ZqmwJ9RBzZGxQpYsI%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Chwreeves%40usgs.gov%7C937e275f923e4ab6d2ae08d9788c022f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637673365006638487%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=st6m%2BwwWWO1l2QBMN3Iym%2BHrR4pVZkxOGU6HJ2Gnyyw%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Chwreeves%40usgs.gov%7C937e275f923e4ab6d2ae08d9788c022f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637673365006648445%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Yp7zHjqoiTiXiOm8BHWMdJ6rYrRXWBd3mAbSZkw3exE%3D&reserved=0.
Ok, change of plan. Turns out check_output is kind of weird. So, the new, and currently working, solution involves a threading.Timer() that sends a SIGKILL to the modflow process started by subprocess.Popen(). Seems to work. Will push changes once I get a little more testing done tonight to make sure everything is good.
Ok, just an update, timeout isn't working as intended, but the fix has been thought out. Essentially the initial fix forgot to turn off the timer once the Popen process was completed, resulting in all runs defaulting to timeout. With a simple process.poll() loop we can negate this unintended effect and finally it should be working. Good news, with new method a "no timeout" run works as intended. Going to try to implement tonight. Sorry for the long fix.
Nice! sounds great Max - looking forward to giving it a shot.
HWR
Howard W. Reeves Acting Chief Science Officer, Groundwater and Watershed Modeling Team Lead USGS Upper Midwest Water Science Center 5840 Enterprise Drive Lansing, Michigan 48911 517-887-8914
From: Max Newcomer @.> Sent: Monday, September 20, 2021 4:55 PM To: maxwnewcomer/NWTOPT @.> Cc: Reeves, Howard W @.>; Comment @.> Subject: [EXTERNAL] Re: [maxwnewcomer/NWTOPT] timeout isn't working as intended (#6)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Ok, just an update, timeout isn't working as intended, but the fix has been thought out. Essentially the initial fix forgot to turn off the timer once the Popen process was completed, resulting in all runs defaulting to timeout. With a simple process.poll() loop we can negate this unintended effect and finally it should be working. Good news, with new method a "no timeout" run works as intended. Going to try to implement tonight. Sorry for the long fix.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmaxwnewcomer%2FNWTOPT%2Fissues%2F6%23issuecomment-923296524&data=04%7C01%7Chwreeves%40usgs.gov%7C31d99f93c87e4bc1571d08d97c791b9f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637677681869820704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nsCGkHK7BIgw1mirXiPXpduENyO51oaFAdLIBRrRTJk%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMHELZHCWBPQN7AOHXZW6LUC6NVTANCNFSM5DVLGAYQ&data=04%7C01%7Chwreeves%40usgs.gov%7C31d99f93c87e4bc1571d08d97c791b9f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637677681869830659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DGUWnu0KJMbH604Gm8IgFdcR%2BEQrOHTYZxuP1lLQrSk%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Chwreeves%40usgs.gov%7C31d99f93c87e4bc1571d08d97c791b9f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637677681869830659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tPY46c%2BDmw%2FTiVeWJJaZeoo%2Fjct6EPJkPz%2Fz5KWQoNU%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Chwreeves%40usgs.gov%7C31d99f93c87e4bc1571d08d97c791b9f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637677681869830659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BvxiIZvN6Q4J0Grd86eMnbT%2Bq15Kcrukx5A%2F%2FP%2BOa%2Bw%3D&reserved=0.
@maxwnewcomer nice job working problem - enjoy the process!
Currently the timeout functionality isn't working and is actually actively ruining runs, possible relation to issue #4 with reading in the run.sh file, possibly not