Closed domsj closed 7 years ago
The maintenance process distribution was designed as "foreach backend, run a maintenance process on all storage nodes until you met the required amount of maintenance processes". The code doesn't know about datacenters, so it doesn't care about the placement.
A solution would be to prefer nodes that have disks claimed to the backend.
When do you deploy maintenance processes? Probably immediately when setting up a backend ... so that's when it doesn't have disks yet. So then you would either have to delay deploying maintenance processes, or move the maintenance processes in some sort of periodic checkup once disks have been claimed?
Valid point, we currently have a function that will validate the current maintenance processes and remove/add some if required. We can extend this so it can basically move them around, and call it more frequently (now it's only when you set up a backend or add a new node) - e.g. every day.
Then we could have an implementation like this:
After some talking with @domsj, we could take it further and start adding a meaning to the domain tags assigned to a backend. We can add these domains to nodes as well, and prefer to have maintenance processes in the same domain. We can even start to assign roles to nodes.
@wimpers, this ticket was suddenly qualified, which option do you want to be implemented?
After discussion with @khenderick:
As a side effect https://github.com/openvstorage/framework/issues/569 should be fixed.
For a global backend using only remote backends this is an issue, as they don't have any nodes on which to run the maintenance mode.
Fixed by #202, packaged in openvstorage-backend-1.7.3-rev.694.f9f958c
Will verify this next week during reinstall/upgrade of OVH environment.
On OVH we see indeed that the maintenance agents are placed in the respective datacenters. Although these are only 2 sites where we observed this.
I noticed on the OVH env that maintenance for the hdd-grav backend was running on some of the roubaix nodes (and no maintenance process on any node of the gravelin datacenter). When set up like this you lose local repair, all data has to travel back and forth between the datacenters...