nguyencaothangtp / google-crawler

Small demo project
0 stars 0 forks source link

Why two tables to manage jobs? #17

Open olivierobert opened 4 years ago

olivierobert commented 4 years ago

There are two tables to manage jobs: jobs and failed_jobs. What are the benefits for the separation?

nguyencaothangtp commented 4 years ago

To store jobs, i could easily use one table 'jobs' and 'status' (success/failed) column to indicate whether a job runs successfully or not. But the problem with that design is after quite some time, the failed jobs is piled up months after months with unprocessed jobs (with failed status) and could reduce the query performance.

In the current design (2 tables), when there is a background job to be run, a record will be inserted in 'jobs' table. As long as the job completed (failed or succeeded), that record will be removed from the 'jobs' table. If it fails and has exceeded a number amount of attempts (3 attempts configured in this case) , the record will be inserted into the failed_jobs database table. The query worker will have an easy time and faster query performance when looking for new jobs to run.

olivierobert commented 4 years ago

Don't you think that having a database index on the column status (which stores "success/failed") would negate the performance issues you are describing? Also, what if there is an error when the successful job is removed and copied over to the other failed job table? This job could be lost forever.

Similarly, for auditing purposes, it's better to store all jobs forever (successful ones too with the datetime when it was completed). But there can be a process to prune old jobs (> 2 weeks for instances) with soft-delete (using a column deleted_at).

nguyencaothangtp commented 4 years ago

I agree that if it is for auditing purposes then it's would be better to store more information like completed_at, deleted_at, may be retried_at .. and make sense to merge them into one single table