Nodes can stay busy when the Scheduler processes a lot of jobs

activeeon-bot commented 10 years ago

Original issue created by Youri Bonnaffe on 03, Sep 2014 at 10:28 AM - SCHEDULING-2143

Observed during stress tests and while fixing SCHEDULING-2106. The Scheduler is using PNP, on Linux, with around 30 nodes (Windows & Linux).

I ran a few jobs ( replicate native tasks x 10, simple script task), submitting thousands of them, one every 200/500ms.

Some of nodes stay busy forever even though there are not running tasks anymore.

I suspect that some locking happened in the DB, freezing the Scheduler for a while and that the nodes were not released properly. The locking is expected and is handled in our code (but maybe it should not freeze that long).

From the logs it looks like it happened when tasks produced Task start timeout for task and when restarted somewhere else.

I checked with REST API or native CLI, in both cases the nodes are busy since a long time.

> listnodes( )
     SOURCE NAME   HOSTNAME                  STATE   SINCE                          URL                                                   PROVIDER   USED BY    

     Default       192.168.1.163             Busy    9/2/14 5:13 PM (17h5mn ago)    pnp://192.168.1.163:2464/grimm_3060                   rm         scheduler  
     Default       buddy.activeeon.com       Free    9/3/14 10:06 AM (12mn ago)     pnp://buddy.activeeon.com:58555/opnestack_3258        rm                    
     Default       chocolate.activeeon.com   Free    9/3/14 10:06 AM (12mn ago)     pnp://chocolate.activeeon.com:42081/chocolate_12343   rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:40082/youriblog_18004             rm                    
     Default       buddy.activeeon.com       Free    9/3/14 10:06 AM (12mn ago)     pnp://buddy.activeeon.com:54487/opnestack_3256        rm                    
     Default       chocolate.activeeon.com   Free    9/3/14 10:06 AM (12mn ago)     pnp://chocolate.activeeon.com:60312/chocolate_12342   rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:55577/youriblog_18003             rm                    
     Default       buddy.activeeon.com       Free    9/3/14 10:06 AM (12mn ago)     pnp://buddy.activeeon.com:57712/opnestack_3257        rm                    
     Default       chocolate.activeeon.com   Busy    9/3/14 9:42 AM (36mn ago)      pnp://chocolate.activeeon.com:38029/chocolate_12345   rm         scheduler  
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:42477/youriblog_18002             rm                    
     Default       192.168.1.155             Busy    9/2/14 6:35 PM (15h42mn ago)   pnp://192.168.1.155:48703/youriblog_942               rm         scheduler  
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:54510/youriblog_18007             rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:51451/youriblog_938               rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:58453/youriblog_945               rm                    
     Default       chocolate.activeeon.com   Free    9/3/14 10:06 AM (12mn ago)     pnp://chocolate.activeeon.com:43302/chocolate_12341   rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:59845/youriblog_944               rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:47751/youriblog_941               rm                    
     Default       tartopum.activeeon.com.   Free    9/3/14 10:06 AM (12mn ago)     pnp://tartopum.activeeon.com.:60346/tartopum_1836     rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:46348/youriblog_18009             rm                    
     Default       tartopum.activeeon.com.   Free    9/3/14 10:06 AM (12mn ago)     pnp://tartopum.activeeon.com.:60319/tartopum_3264     rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:47137/youriblog_937               rm                    
     Default       chocolate.activeeon.com   Free    9/3/14 10:06 AM (12mn ago)     pnp://chocolate.activeeon.com:42262/chocolate_12344   rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:47831/youriblog_946               rm                    
     Default       192.168.1.155             Busy    9/2/14 6:12 PM (16h6mn ago)    pnp://192.168.1.155:53398/youriblog_18006             rm         scheduler  
     Default       192.168.1.163             Busy    9/2/14 5:17 PM (17h0mn ago)    pnp://192.168.1.163:2442/grimm_936                    rm         scheduler  
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:48474/youriblog_18013             rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:33145/youriblog_939               rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:41276/youriblog_18018             rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:36005/youriblog_943               rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:50978/youriblog_18008             rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:51915/youriblog_18005             rm                    
     Default       192.168.1.155             Free    9/3/14 10:06 AM (12mn ago)     pnp://192.168.1.155:49018/youriblog_940               rm                    
     Default       buddy.activeeon.com       Free    9/3/14 10:06 AM (12mn ago)     pnp://buddy.activeeon.com:57702/opnestack_3259        rm

activeeon-bot commented 10 years ago

Attachments:

activeeon-bot commented 9 years ago

Original comment posted by Youri Bonnaffe on 04, Mar 2015 at 10:06 AM

In Scheduler when terminating a task, if an error occurs here, the node might not be released (for instance DB call fails). org.ow2.proactive.scheduler.core.TerminationData#handleTermination

fviale commented 8 years ago

Seems related to a DB issue, might be better now that DB handles concurrent requests

ow2-proactive / scheduling

Nodes can stay busy when the Scheduler processes a lot of jobs #1910