thoth-station / cleanup-job

A job for cleaning up old OpenShift objects created by one-time analyses
https://thoth-station.github.io/
GNU General Public License v3.0
1 stars 5 forks source link

Clean up old Argo workflows #296

Closed fridex closed 4 years ago

ghost commented 4 years ago

Build succeeded.

pacospace commented 4 years ago

In adviser we didn't set podGC https://github.com/argoproj/argo/blob/master/examples/pod-gc-strategy.yaml because we have to look at the pod for some time and also

      ttlStrategy:
        # Keep adviser 1h to let users browse the logs.
        secondsAfterSuccess: 3600
        # secondsAfterFailure kept unset to keep failed advisers

we don't want the failed workflows to be deleted.

But in general, why don't we make adviser fail only for operational issues (e.g. OOM), while if the solver is not there or a package is not there, we don't make adviser failing, so the workflow won't fail. In any case when there is an error related to knowledge in adviser we have a report created. For example, when we answer there are no observations for this software stack, we don't make adviser fail. I think we should not make adviser fail also when solver is not known or a package in the software stack is not known. These are not operational failures of adviser, because it is working correctly, in those cases knowledge is missing and we take care for that.. we are creating unresolved-package-handler, for packages not solved yet for example. WDYT? @fridex

fridex commented 4 years ago

In adviser we didn't set podGC https://github.com/argoproj/argo/blob/master/examples/pod-gc-strategy.yaml because we have to look at the pod for some time and also

      ttlStrategy:
        # Keep adviser 1h to let users browse the logs.
        secondsAfterSuccess: 3600
        # secondsAfterFailure kept unset to keep failed advisers

we don't want the failed workflows to be deleted.

Yes, I kept it for debug purposes - if something fails, let's keep it in the cluster for now so we can debug it. However, logs are deleted after some time in openshift anyway so we store objects that keep info about parameters supplied.

But in general, why don't we make adviser fail only for operational issues (e.g. OOM), while if the solver is not there or a package is not there, we don't make adviser failing, so the workflow won't fail. In any case when there is an error related to knowledge in adviser we have a report created. For example, when we answer there are no observations for this software stack, we don't make adviser fail. I think we should not make adviser fail also when solver is not known or a package in the software stack is not known. These are not operational failures of adviser, because it is working correctly, in those cases knowledge is missing and we take care for that.. we are creating unresolved-package-handler, for packages not solved yet for example. WDYT? @fridex

Long term, I think it might be good idea to clean resources - even if components fail, we should clean them. A cluster where we run Thoth offers us computational power to satisfy user requests or prepare for them. The more consumers we will have of our services, the more failed objects we will possibly store on the cluster side which might lead to cluster operational issues (also possible security implications for denial of service). If there is anything bad, we should have a record in monitoring services with all the relevant info to reproduce the issue (as we have now) and clear the cluster for Thoth customers so we can serve incoming requests.

High level regarding the failed adviser runs - if the adviser does not compute recommendations, it did not satisfy user request and hence it was not successful. The return code of adviser can notify about error type so we can easily integrate it in metrics-exporter (see how many advisers failed due to missing packages, wrong user input, ...) without actually diving into log/result aggregation (simply by checking error code that is available via openshift api when querying directly for an object).

ghost commented 4 years ago

Build succeeded.

goern commented 4 years ago

recheck

ghost commented 4 years ago

Build succeeded.

ghost commented 4 years ago

Build succeeded (gate pipeline).

sesheta commented 4 years ago

Successfully Merged. Thanks for the contribution.