Closed activeeon-bot closed 8 years ago
Original comment posted by Youri Bonnaffe on 04, Mar 2015 at 10:06 AM
In Scheduler when terminating a task, if an error occurs here, the node might not be released (for instance DB call fails). org.ow2.proactive.scheduler.core.TerminationData#handleTermination
Seems related to a DB issue, might be better now that DB handles concurrent requests
Original issue created by Youri Bonnaffe on 03, Sep 2014 at 10:28 AM - SCHEDULING-2143
Observed during stress tests and while fixing SCHEDULING-2106. The Scheduler is using PNP, on Linux, with around 30 nodes (Windows & Linux).
I ran a few jobs ( replicate native tasks x 10, simple script task), submitting thousands of them, one every 200/500ms.
Some of nodes stay busy forever even though there are not running tasks anymore.
I suspect that some locking happened in the DB, freezing the Scheduler for a while and that the nodes were not released properly. The locking is expected and is handled in our code (but maybe it should not freeze that long).
From the logs it looks like it happened when tasks produced Task start timeout for task and when restarted somewhere else.
I checked with REST API or native CLI, in both cases the nodes are busy since a long time.