cartridge: expose operation last error to issues

DifferentialOrange commented 5 months ago

Expose last operation error to Cartridge issues.

issues (Issues are also exposed to default Grafana dashboard, as well as default alerts.)

(Error message could be improved, but it's always has been like this: I haven't changed anything here in this patch.)

The original issue was about exposing migrations inconsistency from new migrations tab to Cartridge issues as well. But using straightforward approach is rather bad: checking inconsistency is a full cluster map-reduce operation, and, if exposed to get_issues, it will omit N^2 network requests since issues are collected from each instance, there is no way to check whether migrations are consistent without cluster map-reduce and there is no distinct migrator provider -- any instance is migration provider. And, since get_issues may trigger rather often, having such a feature may make cluster unhealthy (we already had similar things with metrics [1]). Last error is reset on each operation call.

https://github.com/tarantool/metrics/pull/243

Closes #73

DifferentialOrange commented 5 months ago

The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.

DifferentialOrange commented 5 months ago

For now, I don't see any perfect solution to this one. Two points are the reason:

checking for inconsistency is always a full cluster operation,
module does not have a single entrypoint in terms of Cartridge roles -- user can trigger migrator.up from any node.

If we start to check for inconsistencies on instance start, it may break the cluster in case of new cluster start/full cluster restart/half cluster restart/etc since it would be N^2 again.

DifferentialOrange commented 5 months ago

Persisting an error on up caller is also doesn't seem like a good solution since one may start a migrations from RO instance.

DifferentialOrange commented 5 months ago

Nonetheless, this solution is broken even without restarts -- errors reset on each up, but if an error has been caught on instance 1, then one would call up on instance 2 and everything will be consistent after second up, issue still will be there since it is cached per-instance.

yngvar-antonsson commented 4 months ago

The problem @filonenko-mikhail had pointed out: error is lost on restart, but inconsistency is not.

We have several same issues in Cartridge. I propose just adding a note that the issue stays until restart.

yngvar-antonsson commented 4 months ago

Nonetheless, this solution is broken even without restarts -- errors reset on each up, but if an error has been caught on instance 1, then one would call up on instance 2 and everything will be consistent after second up, issue still will be there since it is cached per-instance.

Maybe we could add some "clear cached issues" button in Cartridge? Users can check the actual status of migrations with the migrations tab, can't they?

tarantool / migrations

cartridge: expose operation last error to issues #74