Closed fridex closed 4 years ago
In adviser we didn't set podGC https://github.com/argoproj/argo/blob/master/examples/pod-gc-strategy.yaml because we have to look at the pod for some time and also
ttlStrategy:
# Keep adviser 1h to let users browse the logs.
secondsAfterSuccess: 3600
# secondsAfterFailure kept unset to keep failed advisers
we don't want the failed workflows to be deleted.
But in general, why don't we make adviser fail only for operational issues (e.g. OOM), while if the solver is not there or a package is not there, we don't make adviser failing, so the workflow won't fail. In any case when there is an error related to knowledge in adviser we have a report created. For example, when we answer there are no observations for this software stack
, we don't make adviser fail. I think we should not make adviser fail also when solver is not known or a package in the software stack is not known. These are not operational failures of adviser, because it is working correctly, in those cases knowledge is missing and we take care for that.. we are creating unresolved-package-handler, for packages not solved yet for example. WDYT? @fridex
In adviser we didn't set podGC https://github.com/argoproj/argo/blob/master/examples/pod-gc-strategy.yaml because we have to look at the pod for some time and also
ttlStrategy: # Keep adviser 1h to let users browse the logs. secondsAfterSuccess: 3600 # secondsAfterFailure kept unset to keep failed advisers
we don't want the failed workflows to be deleted.
Yes, I kept it for debug purposes - if something fails, let's keep it in the cluster for now so we can debug it. However, logs are deleted after some time in openshift anyway so we store objects that keep info about parameters supplied.
But in general, why don't we make adviser fail only for operational issues (e.g. OOM), while if the solver is not there or a package is not there, we don't make adviser failing, so the workflow won't fail. In any case when there is an error related to knowledge in adviser we have a report created. For example, when we answer
there are no observations for this software stack
, we don't make adviser fail. I think we should not make adviser fail also when solver is not known or a package in the software stack is not known. These are not operational failures of adviser, because it is working correctly, in those cases knowledge is missing and we take care for that.. we are creating unresolved-package-handler, for packages not solved yet for example. WDYT? @fridex
Long term, I think it might be good idea to clean resources - even if components fail, we should clean them. A cluster where we run Thoth offers us computational power to satisfy user requests or prepare for them. The more consumers we will have of our services, the more failed objects we will possibly store on the cluster side which might lead to cluster operational issues (also possible security implications for denial of service). If there is anything bad, we should have a record in monitoring services with all the relevant info to reproduce the issue (as we have now) and clear the cluster for Thoth customers so we can serve incoming requests.
High level regarding the failed adviser runs - if the adviser does not compute recommendations, it did not satisfy user request and hence it was not successful. The return code of adviser can notify about error type so we can easily integrate it in metrics-exporter (see how many advisers failed due to missing packages, wrong user input, ...) without actually diving into log/result aggregation (simply by checking error code that is available via openshift api when querying directly for an object).
Build succeeded.
recheck
Build succeeded.
Build succeeded (gate pipeline).
Successfully Merged. Thanks for the contribution.
Build succeeded.