Lag Time Waiting for Monitoring Measurements

wittling commented 6 years ago

Right now, when Scaling dictates the instantiation of a new VM, there is a lag time between the VM being instantiated and the VM connecting to Zabbix. While the Engine waits for this to happen, it throws an exception about the fact that the number of measurements requested does not equal the number of measurements received. No processing happens when this exception occurs - the cycle is skipped until the next interval.

I see several issues with this. Many VMs can be sacrificed on account of a single new instance. Or, what if that VM has an issue and never comes online for some reason? And the other VMs are pegged out and desperately need to scale?

I could be wrong and maybe it needs more design thought, but I am thinking that once a VM is instantiated, the engine needs to subscribe to a queue. And when that VM is connected to Zabbix, it can send a message to the queue that the engine receives, and a flag or state condition is set to mark that VM as "eligible to participate" for purposes of counters and measurements towards scaling policies.

wittling commented 6 years ago

I was checking on the status of this and see no activity on it. Has anyone reviewed it and does anyone agree with my issue here? I think the fix for this could be more complicated than might seem, but I may decided to dig in and address this issue here locally and can discuss a pull request. I would need to put a development system in place to do that.

mpauls commented 5 years ago

Totally agree with you and obviously there is much space for improvements. Current version of the autoscaling engine is quite basic. It recognises the need to scale and scales accordingly by the amount of instances defined in the descriptor. Only if the whole VNF (and also the NS) goes to ACTIVE again (so EMS and zabbix agent is up and running), it may scale again. Sometimes it takes also some time until measurement results are visible in zabbix.

Potential ideas might be 1st) to have a pool of prepared VMs or in standby (like the fault management system uses) or 2nd) extending the FMS in order to cope with unproper instantiated VNFCs.

I close this issue for the moment since it seems like an extension/improvment and not like a real issue. I'd be happy to see a PR or so that deals with that.

openbaton / autoscaling-engine

Lag Time Waiting for Monitoring Measurements #21