Closed DifferentialOrange closed 5 months ago
The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.
For now, I don't see any perfect solution to this one. Two points are the reason:
migrator.up
from any node.If we start to check for inconsistencies on instance start, it may break the cluster in case of new cluster start/full cluster restart/half cluster restart/etc since it would be N^2 again.
Persisting an error on up
caller is also doesn't seem like a good solution since one may start a migrations from RO instance.
Nonetheless, this solution is broken even without restarts -- errors reset on each up
, but if an error has been caught on instance 1, then one would call up
on instance 2 and everything will be consistent after second up
, issue still will be there since it is cached per-instance.
The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.
We have several same issues in Cartridge. I propose just adding a note that the issue stays until restart.
Nonetheless, this solution is broken even without restarts -- errors reset on each
up
, but if an error has been caught on instance 1, then one would callup
on instance 2 and everything will be consistent after secondup
, issue still will be there since it is cached per-instance.
Maybe we could add some "clear cached issues" button in Cartridge? Users can check the actual status of migrations with the migrations tab, can't they?
Expose last operation error to Cartridge issues.
(Issues are also exposed to default Grafana dashboard, as well as default alerts.)
(Error message could be improved, but it's always has been like this: I haven't changed anything here in this patch.)
The original issue was about exposing migrations inconsistency from new
migrations
tab to Cartridge issues as well. But using straightforward approach is rather bad: checking inconsistency is a full cluster map-reduce operation, and, if exposed toget_issues
, it will omit N^2 network requests since issues are collected from each instance, there is no way to check whether migrations are consistent without cluster map-reduce and there is no distinct migrator provider -- any instance is migration provider. And, sinceget_issues
may trigger rather often, having such a feature may make cluster unhealthy (we already had similar things with metrics [1]). Last error is reset on each operation call.Closes #73