terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Add job delete to Teraslice API #3680

Open godber opened 1 month ago

godber commented 1 month ago

One problem we've often had in running lots of Teraslice jobs is keeping track of jobs that are meant to be running over longer periods of time. Sometimes jobs are replaced with newer versions or jobs are altogether abandoned. These unused jobs accumulate on the cluster and in can become confusing as to which jobs are really meant to be running. I tried to sneak the active/inactive property onto the jobs to address that but it might be best to simply allow job deletion.

We briefly discussed the merits of just deleting the job alone and leaving the associated executions, slices and analytics "orphaned" in the state cluster. My gut reaction was that orphaning those resources was probably better than dealing with deleting jobs and having ES have to deal with deleted records in those indices. Ultimately the state indices are all timeseries and will get curated away.

kstaken commented 1 month ago

You can deal with the orphaning problem by not really deleting the record, just label it as deleted and then exclude it by default from any job listing views. Then add a parameter that can show deleted records if needed.

It should also not be possible to start a job that is marked as deleted and it should not be possible to delete a job that is currently running.

Deleted record cleanup can then become a batch process that runs periodically outside of a reasonable window for job history.

godber commented 1 month ago

Yeah, all of that makes sense, perhaps we can extend:

Deleted record cleanup can then become a batch process that runs periodically outside of a reasonable window for job history.

Somehow we store a date after which it is safe to delete the record. Perhaps a deleted_on property or a delete_after property.

busma13 commented 1 month ago

Then add a parameter that can show deleted records if needed.

Would we want this parameter to show only the jobs marked deleted or to show all records (with deleted records labeled as such)? Or maybe this will be dependent on the listing in question?

EDIT: we will add options for both.

EDIT2: having this third option made things a bit confusing. We switched to true|false. If a user wants the whole list they can just combine the true list and the false list.

godber commented 4 days ago

We have decided that only Jobs should have a _deleted property. Executions should not have a _deleted property. This means the following things:

godber commented 4 days ago

We have decided that only Jobs should have a _deleted property. Executions should not have a _deleted property.

I am rescinding this comment. Doing this makes filtering /txt/ex too expensive. I am overthinking it I guess.

busma13 commented 2 days ago

I am also going to mark the _active and _inactive endpoints as deprecated.