hummingbird-project / swift-jobs

Offload work your server would be doing to another server
Apache License 2.0
8 stars 1 forks source link

Scheduled Jobs #13

Closed adam-fowler closed 2 weeks ago

adam-fowler commented 3 weeks ago

Currently there is an unmerged PR #5 that adds a JobScheduler service. This includes a service that is run on one machine that adds jobs to a job queue at scheduled times (to be processed by this or other instances). Jobs schedules can be every minute, hourly, daily, day of the week, monthly, yearly and are repeating. The schedule is built in code, there is no permanent storage of jobs outside of your executable.

A Job schedule might look like this

let jobSchedule = JobSchedule([
    (job: SundayCleanupJob(), schedule: .weekly(day: .sunday, hour: 4)),
    (job: BirthdayJob(), schedule: .yearly(month: .march, date: 4)),
])

I've adding this issue to discuss if this could be considered enough for a scheduled job service. If not what else would you expect to be included?

adam-fowler commented 3 weeks ago

@Joannis @thoven87

thoven87 commented 3 weeks ago

@adam-fowler I think we would was to persist if an instance of a job is already running or has run already. This would make a schedule job idempotent.

Without a persistent storage, I think the followings can happen:

1 - More than one job can be picked up by different workers 2 - The same job could re-run after a restart.

The first issue could be solved with a scheduler service, but that'll complicate things further. One would need a central scheduler service running somewhere.

For the second issue, If I schedule a Job to run every month at 12:00 AM UTC and I deploy some change at 12:05 UTC. Would the job re-run after the server restart?

I looked at the code again, I think we allow options to only have one job type run at a given schedule without a persistent storage. Say, I have a scheduled job that generates weekly news letters another job generates stats everyday 12 AM UTC, could there a chance were the job stats has more than one instance running?

thoven87 commented 3 weeks ago

Overall, the scheduled job implementation looks great. I think it would be nice to add a small random delay before each scheduled job kicks in. Say, between 1 to 5 seconds. Is it best practice to have Just one Job controller and register all Jobs to said controller?

Joannis commented 3 weeks ago

My main issue with this implementation is that the schedule isn't persisted. If the scheduler is offline, even for a brief period, any jobs during the offline period will not be ran.

This could be resolved fairly simply, by being ale to get a Date representing when the last job was ran. That way, the schedule can queue every job that needs to run after that Date. We could probably get this job from the driver.

There's a caveat to this approach though, being that if the job queue is struggling to keep up you might spawn a bunch more work than it can handle. How I envision that being resolved is by allowing each schedule to define behaviour for these scenarios. Whether to catch up, or limit it to N instances (1 for example).

We'd need to educate users on the advantages and caveats of both approaches, and make sure they choose the right one and implement it correctly though.

adam-fowler commented 3 weeks ago

This could be resolved fairly simply, by being ale to get a Date representing when the last job was ran. That way, the schedule can queue every job that needs to run after that Date. We could probably get this job from the driver.

That was what I was thinking. I could use the persist framework to abstract this.

There's a caveat to this approach though

I can see us adding an .all vs .latest parameter which indicates whether all scheduled jobs should be run vs just the latest

thoven87 commented 3 weeks ago

This could be resolved fairly simply, by being ale to get a Date representing when the last job was ran. That way, the schedule can queue every job that needs to run after that Date. We could probably get this job from the driver.

That was what I was thinking. I could use the persist framework to abstract this.

There's a caveat to this approach though

I can see us adding an .all vs .latest parameter which indicates whether all scheduled jobs should be run vs just the latest

I think the all approach here could be very useful. It would allow further options in the future to build a dependency graph for jobs. I can have an hourly job which computes order summary, then a daily job that requires all 24 partitions starting at hour 00 - 23.

I know the current implementation does not allow jobs to chain each other:

Ingest (daily) --> Analyze (daily) --> Check Integrity(daily) --> Email Errors(daily) --> Save Report(daily)

In doing so, it would be great to have a different job that cleans previous entries. We could allow the partition completed to be expired after n number of days.

adam-fowler commented 2 weeks ago

Ok I have implemented the last date job queued storage. I realised I can't use the persist framework for storing metadata, as it brings in hummingbird as a dependency. So I have added a requirement to the job queue drivers to have a simple key/value store.

I know the current implementation does not allow jobs to chain each other:

There is nothing stopping you queuing one job frominside the closure of another job.

thoven87 commented 2 weeks ago

Ok I have implemented the last date job queued storage. I realised I can't use the persist framework for storing metadata, as it brings in hummingbird as a dependency. So I have added a requirement to the job queue drivers to have a simple key/value store. I think there could be a possibility that no job starts depended on the persist driver that's used. I just tested both PR opened and I noticed the Postgres Persist drive blocks with

: [PostgresNIO] Trying to lease connection from `PostgresClient`, but `PostgresClient.run()` hasn't been called yet.

I suppose either get or set Metadata is being called before the service is ready.

I know the current implementation does not allow jobs to chain each other:

There is nothing stopping you queuing one job frominside the closure of another job. This is indeed possible currently. Should a UI be introduced to manage jobs, it would be great to see the DAG (Directed Acyclic Graphs) and later on data lineage.

Instead of queuing a job within an another job, say JobB depends on JobA to be completed.

We could define such requirements via configuration As such

JobA (Runs hourly) <- JobB(Runs hourly with an offset of 5 minutes)

In this scenario, JobB would check the metadata for JobB before it runs.

I think this path can be explored later.

adam-fowler commented 2 weeks ago
: [PostgresNIO] Trying to lease connection from `PostgresClient`, but `PostgresClient.run()` hasn't been called yet.

I suppose either get or set Metadata is being called before the service is ready.

Yes that is correct. I have a fix for this. See https://github.com/hummingbird-project/swift-jobs/pull/5#discussion_r1730418379

Instead of queuing a job within an another job, say JobB depends on JobA to be completed.

We could define such requirements via configuration As such

I was just saying what was possible now.

thoven87 commented 2 weeks ago
: [PostgresNIO] Trying to lease connection from `PostgresClient`, but `PostgresClient.run()` hasn't been called yet.

I suppose either get or set Metadata is being called before the service is ready.

Yes that is correct. I have a fix for this. See #5 (comment)

I saw that, thank you!

Instead of queuing a job within an another job, say JobB depends on JobA to be completed. We could define such requirements via configuration As such

I was just saying what was possible now.

I suppose, I am looking at the wrong places. I do not see any options configure the followings: Job1 should be done before Job2 runs or Job2 requires partition hourly|daily|monthly of Job1 to be ready before kick off.

adam-fowler commented 2 weeks ago

I meant you could do the following

jobQueue.registerJob { (params: Job1Parameters, _) in
    // do job1
    ...
    // queue job2
    try await jobQueue.push(job2Parameters)
}

There isn't any other support for dependencies currently.