nambar12 / matlab_netbatch_scheduler

0 stars 0 forks source link

Concerning the issue posted earlier about asking for 140 workers but getting 143... #6

Closed pasafier closed 2 years ago

pasafier commented 2 years ago

As I mentioned, Matlab appeared to be hung in a state of starting the parpool. All the while, Netbatch showed 144 machines in use.

A control-C command killed the initialization of the parpool, as expected. Then the errors stated ...Failed to start parpool. However, the machines were still showing as allocated in Netbatch. Is the pool allocated or not? So, I tried to run a script and errors popped up indicating the parpool was not working properly:

Simple_Parallel_Code Running parallel Starting parallel pool (parpool) using the 'netbatch' profile ... Warning: Failed to cancel the following jobs on the cluster: Job ID: 38 Reason: nbjob: Service on host: 8 is not responding.

In cancelJobFcn (line 58) In deleteJobFcn (line 9) In parallel.cluster/Generic/deleteJobOrTask (line 708) In parallel.cluster/Generic/hDestroyJob (line 485) In parallel.internal.cluster/CJSJobMethods/destroyOneJob (line 71) In parallel.job.CJSCommunicatingJob>@(job)CJSJobMethods.destroyOneJob(job.Parent,job,job.Support,job.SupportID) (line 100) In parallel.job/CJSCommunicatingJob/destroyJob (line 100) In parallel.Job>iDeleteJobs (line 1538) In parallel.internal.cluster.hetfun (line 57) In parallel/Job/delete (line 1335) In parallel/Cluster/hDeleteOneJob (line 1023) In parallel.internal.pool.AbstractInteractiveClient>iDeleteJobs (line 505) In parallel.internal.pool/AbstractInteractiveClient/pStopLabsAndDisconnect (line 289) In parallel.internal.pool.AbstractInteractiveClient>iCleanupIfStartupFailed (line 575) In parallel.internal.pool.AbstractInteractiveClient>@()iCleanupIfStartupFailed(obj) (line 96) In parallel.internal.general/DisarmableOncleanup/delete (line 25) In parallel.internal.pool/AbstractInteractiveClient/start (line 77) In parallel.internal.pool.AbstractClusterPool>iStartClient (line 816) In parallel.internal.pool/AbstractClusterPool/hBuildPool (line 582) In parallel.internal.pool.doParpool (line 22) In parpool (line 128) In parallel.internal.pool/PoolArrayManager/getOrAutoCreateWithCleanup (line 58) In pctTryCreatePoolIfNecessary (line 28) In parallel_function (line 418) In Simple_Parallel_Code (line 15)

Hovering the mouse of the parpool icon at the bottom left of the Matlab app shows "Failed to start the parallel pool".

HOWEVER, it seems that the script may have actually run. This needs to be verified. So, is it possible to actually have the pool started and functioning but Matlab think that it's not?

nambar12 commented 2 years ago

Seems like the cancel command is not working properly - we're working to fix it

ravindraiitg commented 2 years ago

You are right Noam. I have fixed cancelJob command issue. This should get fixed as well.