Closed meliache closed 2 years ago
I've also see this at times. Could it also be because of an expired AFS token?
@nrad That's a possiblitity, but something you should be able to check. Don't know by heart a command to do so (though there should be), but I always seed garbled file permissions when my AFS token expired, i.e. a lot of ???
when I do ls -l ~
. I had asked Alina to check that after her crash and she saw nothing unusual.
I now took the time to google the error message SECMAN:2007:Failed to end classad message.
and saw several mailing list posts on this, so I assume it is an issue with htcondor, in particular with condor_q
not working for a while. See Re: [HTCondor-users] SECMAN:2007:Failed to end classad message.:
[...] the
job_queue.log
file holds the transaction log for the schedd, and it is read by the schedd on startup in order to recover the state of the jobs. While the schedd is reading the transaction log on startup, it will not respond tocondor_q
queries. [...]
I think this is positive, since it means a "wait and try" fix should work.
Jobs on the HTCondor sometimes fail because
condor_q
sometimes returns an error code. A colleague reported the stacktrace below to me, however I had also ones seen the error when running with over 1000 workers. I don't know the cause yet, but back when I saw this I guess that maybe due to the many workers condor couldn't handle the sheer number of requests. But this is just a guess.@welschma, have you ever seen this and a suggestion with respect to the cause?
Even if it is not our error, I think we could handle it more gracefully, for example by giving a better error message or maybe just retrying obtaining the job status a couple of times until some maximum number or retries or timeout is reached (I did something similar to gbasf2 in #108). But for that, knowing the cause would be useful.