oban-bg / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://oban.pro
Apache License 2.0
3.35k stars 312 forks source link

Jobs stuck in "available" state #493

Closed 0x6a68 closed 3 years ago

0x6a68 commented 3 years ago

Environment

Current Behavior

Jobs are stuck in the available state only manual drain_queue/2 moves these jobs.

DEBUG

SELECT worker,attempt,scheduled_at,inserted_at FROM oban_jobs WHERE state = 'available';

               worker                 | attempt |        scheduled_at        |        inserted_at
--------------------------------------+---------+----------------------------+----------------------------
 MyApp.Workers.DeliverLaterMailWorker |       0 | 2021-05-28 03:00:02.544466 | 2021-05-28 03:00:02.544466
SELECT queue,meta FROM oban_producers;

        queue         |                                                                                meta
----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 mailers              | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 20, "prev_count": 0, "window_time": "21:27:11"}, "local_limit": 10, "global_limit": 10}
 mailers              | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 20, "prev_count": 0, "window_time": "21:27:11"}, "local_limit": 10, "global_limit": 10}
jaimeiniesta commented 3 years ago

I had this problem too in Oban 2.6, it looks like it was caused by cancel_job - I use that a lot. But I believe this was already fixed in 2.7?

sorentwo commented 3 years ago

@0x6a68 The oban_producers output shows that you're using rate limiting. The curr_count for each node is 20, and the allowed limit is 40. Based on that, I'd expect that it wasn't running more jobs at that particular moment.

Is this an ongoing problem? Do jobs run initially and then stop?

angelikatyborska commented 3 years ago

Hello! I'm in the same project as @0x6a68

Is this an ongoing problem?

Yes, it keeps happening. We're migrating our email sending from Rihanna to Oban right now. Every morning we sent out a whole bunch of emails from our staging environment, and almost every day somebody doesn't receive all of the emails.

Do jobs run initially and then stop?

Yes, a few jobs in the queue do get executed and then the queue stops for no apparent reason. For example, yesterday when this happened, the queue got unstuck after the application restarted itself after a deployment.

This morning for example, we have our "mailers" queue stuck with 112 jobs available and none executing.

Screen Shot 2021-05-29 at 07 32 41

Debug queries:

# select id, state, attempt, max_attempts, errors, inserted_at, scheduled_at, attempted_at, completed_at, discarded_at from oban_jobs where queue='mailers' and state='available';
#
# query run around 2021-05-29 05:30:00
# 

  id  |   state   | attempt | max_attempts | errors |        inserted_at         |        scheduled_at        | attempted_at | completed_at | discarded_at
------+-----------+---------+--------------+--------+----------------------------+----------------------------+--------------+--------------+--------------
 1899 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.387879 | 2021-05-29 03:00:03.387879 |              |              |
 1923 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.645091 | 2021-05-29 03:00:03.645091 |              |              |
 1844 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.86531  | 2021-05-29 03:00:02.86531  |              |              |
 1892 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.385977 | 2021-05-29 03:00:03.385977 |              |              |
 1905 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.482801 | 2021-05-29 03:00:03.482801 |              |              |
 1853 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.893853 | 2021-05-29 03:00:02.893853 |              |              |
 1893 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.400265 | 2021-05-29 03:00:03.400265 |              |              |
 1906 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.501134 | 2021-05-29 03:00:03.501134 |              |              |
 1925 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.661465 | 2021-05-29 03:00:03.661465 |              |              |
 1841 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.803508 | 2021-05-29 03:00:02.803508 |              |              |
 1858 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.007793 | 2021-05-29 03:00:03.007793 |              |              |
 1884 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.233995 | 2021-05-29 03:00:03.233995 |              |              |
 1912 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.529712 | 2021-05-29 03:00:03.529712 |              |              |
 1840 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.729769 | 2021-05-29 03:00:02.729769 |              |              |
 1876 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.084729 | 2021-05-29 03:00:03.084729 |              |              |
 1894 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.392712 | 2021-05-29 03:00:03.392712 |              |              |
 1854 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.888188 | 2021-05-29 03:00:02.888188 |              |              |
 1859 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.007753 | 2021-05-29 03:00:03.007753 |              |              |
 1909 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.522436 | 2021-05-29 03:00:03.522436 |              |              |
 1934 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.780278 | 2021-05-29 03:00:03.780278 |              |              |
 1867 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.00419  | 2021-05-29 03:00:03.00419  |              |              |
 1926 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.746323 | 2021-05-29 03:00:03.746323 |              |              |
 1939 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.803615 | 2021-05-29 03:00:03.803615 |              |              |
 1868 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.011569 | 2021-05-29 03:00:03.011569 |              |              |
 1890 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.255833 | 2021-05-29 03:00:03.255833 |              |              |
 1938 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.792717 | 2021-05-29 03:00:03.792717 |              |              |
 1839 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.816816 | 2021-05-29 03:00:02.816816 |              |              |
 1864 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.004659 | 2021-05-29 03:00:03.004659 |              |              |
 1887 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.242201 | 2021-05-29 03:00:03.242201 |              |              |
 1889 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.307856 | 2021-05-29 03:00:03.307856 |              |              |
 1935 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.767265 | 2021-05-29 03:00:03.767265 |              |              |
 1835 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.811579 | 2021-05-29 03:00:02.811579 |              |              |
 1855 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.908493 | 2021-05-29 03:00:02.908493 |              |              |
 1862 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.005065 | 2021-05-29 03:00:03.005065 |              |              |
 1891 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.398024 | 2021-05-29 03:00:03.398024 |              |              |
 1910 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.525756 | 2021-05-29 03:00:03.525756 |              |              |
 1928 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.739964 | 2021-05-29 03:00:03.739964 |              |              |
 1831 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.802003 | 2021-05-29 03:00:02.802003 |              |              |
 1866 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.010031 | 2021-05-29 03:00:03.010031 |              |              |
 1907 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.521986 | 2021-05-29 03:00:03.521986 |              |              |
 1927 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.747317 | 2021-05-29 03:00:03.747317 |              |              |
 1863 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.008401 | 2021-05-29 03:00:03.008401 |              |              |
 1931 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.746731 | 2021-05-29 03:00:03.746731 |              |              |
 1832 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.806329 | 2021-05-29 03:00:02.806329 |              |              |
 1860 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.004569 | 2021-05-29 03:00:03.004569 |              |              |
 1883 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.21724  | 2021-05-29 03:00:03.21724  |              |              |
 1936 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.789209 | 2021-05-29 03:00:03.789209 |              |              |
 1875 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.028315 | 2021-05-29 03:00:03.028315 |              |              |
 1886 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.250968 | 2021-05-29 03:00:03.250968 |              |              |
 1911 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.528702 | 2021-05-29 03:00:03.528702 |              |              |
 1930 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.741011 | 2021-05-29 03:00:03.741011 |              |              |
 1856 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.996979 | 2021-05-29 03:00:02.996979 |              |              |
 1885 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.216847 | 2021-05-29 03:00:03.216847 |              |              |
 1932 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.758266 | 2021-05-29 03:00:03.758266 |              |              |
 1857 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.003571 | 2021-05-29 03:00:03.003571 |              |              |
 1908 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.522313 | 2021-05-29 03:00:03.522313 |              |              |
 1933 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.744056 | 2021-05-29 03:00:03.744056 |              |              |
 1824 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.707833 | 2021-05-29 03:00:02.707833 |              |              |
 1842 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.773292 | 2021-05-29 03:00:02.773292 |              |              |
 1873 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.068901 | 2021-05-29 03:00:03.068901 |              |              |
 1877 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.180253 | 2021-05-29 03:00:03.180253 |              |              |
 1901 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.411405 | 2021-05-29 03:00:03.411405 |              |              |
 1916 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.550459 | 2021-05-29 03:00:03.550459 |              |              |
 1845 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.884677 | 2021-05-29 03:00:02.884677 |              |              |
 1865 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.004595 | 2021-05-29 03:00:03.004595 |              |              |
 1882 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.19536  | 2021-05-29 03:00:03.19536  |              |              |
 1929 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.744759 | 2021-05-29 03:00:03.744759 |              |              |
 1830 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.730173 | 2021-05-29 03:00:02.730173 |              |              |
 1847 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.871844 | 2021-05-29 03:00:02.871844 |              |              |
 1879 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.183359 | 2021-05-29 03:00:03.183359 |              |              |
 1902 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.402377 | 2021-05-29 03:00:03.402377 |              |              |
 1913 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.544816 | 2021-05-29 03:00:03.544816 |              |              |
 1833 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.758248 | 2021-05-29 03:00:02.758248 |              |              |
 1871 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.060362 | 2021-05-29 03:00:03.060362 |              |              |
 1895 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.388636 | 2021-05-29 03:00:03.388636 |              |              |
 1917 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.583812 | 2021-05-29 03:00:03.583812 |              |              |
 1843 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.775372 | 2021-05-29 03:00:02.775372 |              |              |
 1869 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.047597 | 2021-05-29 03:00:03.047597 |              |              |
 1900 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.387556 | 2021-05-29 03:00:03.387556 |              |              |
 1921 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.64075  | 2021-05-29 03:00:03.64075  |              |              |
 1838 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.753987 | 2021-05-29 03:00:02.753987 |              |              |
 1874 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.061616 | 2021-05-29 03:00:03.061616 |              |              |
 1878 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.182445 | 2021-05-29 03:00:03.182445 |              |              |
 1896 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.394886 | 2021-05-29 03:00:03.394886 |              |              |
 1915 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.552317 | 2021-05-29 03:00:03.552317 |              |              |
 1820 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.689917 | 2021-05-29 03:00:02.689917 |              |              |
 1834 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.808906 | 2021-05-29 03:00:02.808906 |              |              |
 1851 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.892257 | 2021-05-29 03:00:02.892257 |              |              |
 1861 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.002391 | 2021-05-29 03:00:03.002391 |              |              |
 1881 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.195865 | 2021-05-29 03:00:03.195865 |              |              |
 1937 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.785719 | 2021-05-29 03:00:03.785719 |              |              |
 1852 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.867351 | 2021-05-29 03:00:02.867351 |              |              |
 1904 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.461122 | 2021-05-29 03:00:03.461122 |              |              |
 1920 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.641171 | 2021-05-29 03:00:03.641171 |              |              |
 1821 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.689906 | 2021-05-29 03:00:02.689906 |              |              |
 1880 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.166848 | 2021-05-29 03:00:03.166848 |              |              |
 1888 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.308632 | 2021-05-29 03:00:03.308632 |              |              |
 1837 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.753371 | 2021-05-29 03:00:02.753371 |              |              |
 1849 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.86261  | 2021-05-29 03:00:02.86261  |              |              |
 1872 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.07127  | 2021-05-29 03:00:03.07127  |              |              |
 1919 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.636488 | 2021-05-29 03:00:03.636488 |              |              |
 1850 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.87714  | 2021-05-29 03:00:02.87714  |              |              |
 1898 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.393738 | 2021-05-29 03:00:03.393738 |              |              |
 1903 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.437043 | 2021-05-29 03:00:03.437043 |              |              |
 1922 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.643544 | 2021-05-29 03:00:03.643544 |              |              |
 1848 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.869857 | 2021-05-29 03:00:02.869857 |              |              |
 1918 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.576678 | 2021-05-29 03:00:03.576678 |              |              |
 1846 | available |       0 |            3 | {}     | 2021-05-29 03:00:02.858468 | 2021-05-29 03:00:02.858468 |              |              |
 1870 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.062113 | 2021-05-29 03:00:03.062113 |              |              |
 1897 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.412495 | 2021-05-29 03:00:03.412495 |              |              |
 1914 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.542791 | 2021-05-29 03:00:03.542791 |              |              |
 1924 | available |       0 |            3 | {}     | 2021-05-29 03:00:03.64091  | 2021-05-29 03:00:03.64091  |              |              |
(112 rows)
# select * from oban_producers where queue='mailers';

                 uuid                 | name |            node            |  queue  |                                                                                meta                                                                                | running_ids |         started_at         |         updated_at
--------------------------------------+------+----------------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------+----------------------------
 f89ac10f-dcf6-4871-97d3-79abc95e3d1e | Oban | ypsilon-staging@app02-prod | mailers | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 24, "prev_count": 0, "window_time": "19:50:06"}, "local_limit": 10, "global_limit": 10} | {}          | 2021-05-28 13:37:31.459809 | 2021-05-29 05:30:21.840363
 8d41bfe7-9efb-4835-bf9b-147664f84760 | Oban | ypsilon-staging@app01-prod | mailers | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 16, "prev_count": 0, "window_time": "19:50:06"}, "local_limit": 10, "global_limit": 10} | {}          | 2021-05-28 13:36:34.792854 | 2021-05-29 05:30:21.847764
(2 rows)
sorentwo commented 3 years ago

@0x6a68 @angelikatyborska Thanks for the rate limit details. That proved to be the key! The issue stemmed from an incorrect window calculation when the queue was inactive for a while, particularly when the queue was inactive spanning midnight.

Expect a new patch release soon 💛

jaimeiniesta commented 3 years ago

I had stuck jobs in available state again today. Running on:

Oban v2.7.1 Oban.Web v2.7.0 Oban.Pro v0.8.0 Elixir 1.12.1

After restarting the app, jobs were picked up.

jaimeiniesta commented 3 years ago

Stuck again. This is how the producers table is now, with 922 stuck jobs on the same queue (3 workers).

select * from oban_producers where queue='page_creation';
                 uuid                 | name | node  |     queue     |                meta                 | running_ids |         started_at         |        updated_at
--------------------------------------+------+-------+---------------+-------------------------------------+-------------+----------------------------+---------------------------
 6b7a6400-4042-488f-b4da-501d2a260260 | Oban | web.1 | page_creation | {"paused": false, "local_limit": 3} | {}          | 2021-06-01 12:13:27.106366 | 2021-06-01 12:38:29.03955

In my case the queue hasn't been inactive, there was a lot of movement going on there.

  use Oban.Worker,
    queue: "page_creation",
    max_attempts: 25,
    unique: [
      fields: [:args, :worker],
      keys: [:sitemap_id, :url],
      states: [:scheduled, :available, :executing, :retryable],
      period: :infinity
    ]

Restarting the server makes it pick up the jobs.

0x6a68 commented 3 years ago

Yeah, same here :)

Patch hasn't been released.

jaimeiniesta commented 3 years ago

OK in that case I'll put a monitor in place to check for stuck jobs and run drain_queue if that situation is detected...

0x6a68 commented 3 years ago

In our case we used the Oban.Plugins.Repeater https://hexdocs.pm/oban/Oban.Plugins.Repeater.html

jaimeiniesta commented 3 years ago

Thanks, I tried the Repeater but it did not avoid jobs being stuck again.

Just updated to Oban Web 2.7.1 and Oban Pro 0.8.1, I hope that fixes the issue 🤞

janwillemvd commented 2 years ago

We ran into the same issue last night (and the night before).

We're using: Oban 2.9.2, Oban Pro 0.9.2 and Oban Web 2.7.4.

Should this issue be fixed now or is this still an open issue?

sorentwo commented 2 years ago

Should this issue be fixed now or is this still an open issue?

This isn't still an open issue as far as I've heard. If you share more about your config, which queue is stuck (or all), etc. then I can diagnose the issue and possibly give some advice now.

matthijsqd commented 2 years ago

@sorentwo I am @janwillemvd 's colleague

Config:

[
  engine: Oban.Pro.Queue.SmartEngine,
  repo: MyApp.Repo,
  plugins: [
    Oban.Plugins.Gossip,
    {Oban.Pro.Plugins.DynamicPruner, queue_overrides: [my_queue: {:max_age, {2, :weeks}}]},
    Oban.Pro.Plugins.Lifeline,
    Oban.Web.Plugins.Stats
  ],
  queues: [
    my_queue: [
      global_limit: 10,
      rate_limit: [allowed: 1, period: {1, :second}]
    ]
  ]
]

Worker:

...

use Oban.Worker,
    queue: :my_queue,
    max_attempts: 3,
    unique: [states: [:available, :scheduled, :retryable]]

  @impl true
  def backoff(%Oban.Job{attempt: attempt}) do
    trunc(:math.pow(attempt, 4) + :rand.uniform(20) * attempt)
  end

...

We restored the database to the point in time just before we paused and resumed Oban. At that time, the oban_producers table was empty.

sorentwo commented 2 years ago

@janwillemvd @matthijsqd Do you have a record of any errors, e.g. the producer crashing?

janwillemvd commented 2 years ago

Hi @sorentwo, we've checked the logs. Apart from some job errors, we see no errors of the producer crashing.

sorentwo commented 2 years ago

Without more details, I don't have much to go off of. A few more questions:

Any other details will really help me to diagnose and advise.

matthijsqd commented 2 years ago

We've only just deployed this project on 2022-01-04, since then it happened twice:

There is only one queue in this project. We have since disabled the rate limiting on this queue and it has not occurred since.

matthijsqd commented 2 years ago

Also, is it normal that the oban_producers table gets cleared? If so, when does that happen?

sorentwo commented 2 years ago

There is only one queue in this project. We have since disabled the rate-limiting on this queue, and it has not occurred since.

Ok, that seems telling. My hunch is that the producer crashed during a rate-limit update. I'll investigate and try to recreate that situation.

Also, is it normal that the oban_producers table gets cleared? If so, when does that happen?

Yes, that's entirely normal. However, producer records are ephemeral and only need to live as long as the queue is alive. So old producers, any producer that hasn't refreshed (pinged) recently, is a candidate for deletion.

janwillemvd commented 2 years ago

My hunch is that the producer crashed during a rate-limit update. I'll investigate and try to recreate that situation.

Thanks @sorentwo, please let us know if we can help or you want us to test something.

kenzhengguan commented 2 years ago

@janwillemvd I had the same issue. Just figured out why. I ran Oban.check_queue/1 then got this error:

** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "default"}}}}, :check, 5000)
    ** (EXIT) an exception was raised:
        ** (Postgrex.Error) ERROR 23514 (check_violation) new row for relation "oban_jobs" violates check constraint "attempt_range"

    table: oban_jobs
    constraint: attempt_range 

Failing row contains (2208032, default, App.Jobs.SyncInvoiceJob, {"invoice_id": "<invoice_id>"}, {"{\"at\": \"2022-02-23T02:38:22.859813Z\", \"error\": \"** (Und..., 3, 2, 2022-02-23 02:38:20.344931, 2022-02-23 02:38:38.859775, 2022-02-23 07:54:20.42105, null, {app@127.0.0.1,6eacdd69-adc6-470a-ab4f-75161237d360}, 2022-02-23 02:38:39.350624, 0, {}, {}, null, executing).

Then I checked this record (2208032), the attempt and max_attempts are both 2 and state is available. But it should be discarded or completed, should not be available. I manually updated the state to discarded and it worked.

Hope this helps!

janwillemvd commented 2 years ago

Thanks @kenzhengguan, but I think your example differs from my issue.

@sorentwo Did you maybe find some time to investigate this issue further? Can we help with anything?

sorentwo commented 2 years ago

@janwillemvd Sorry, I missed your previous comment. Since this conversation I diagnosed a sneaky cause of stuck jobs. It came down to a lingering transactional advisory lock that stopped any job availability announcements. In most cases a redeploy fixed the lock, and then jobs would run again.

In v2.11 the stager doesn’t use transactional locks at all. Have you tried running that version?

I’ll ask for more info and try to replicate if jobs still don’t run for you with rate limiting enabled.

janwillemvd commented 2 years ago

No worries @sorentwo, thanks!

We'll upgrade to v2.11 in the upcoming weeks and I'll definitely let you know whether this solved our problem (of course, we'll enable the rate limiting again in v2.11).

mikeover commented 2 years ago

I'm seeing this same issue with Oban v2.12.0 and Oban.Web v2.9.1. We have a few daily jobs that kick off at the same time each day; for the past 4 straight days they go into the Available state and they sit there (nothing executing). Running drain_queue on them works fine and they run successfully (as does restarting the server) but otherwise they seem to just be stuck. Not using any kind of rate limiting.

Config:

config :my_app, Oban,
  repo: MyApp.Repo,
  plugins: [
    # retain cancelled, completed, discarded jobs for 60 days(in second)
    {Oban.Plugins.Pruner, max_age: 60 * 24 * 60 * 60},
    {Oban.Plugins.Cron,
     crontab: [
       {"0 6 * * *", MyWorker1,  queue: :first_queue},
       {"0 6 * * *", MyWorker2,  queue: :second_queue},
       {"0 6 * * *", MyWorker3,  queue: :third_queue}
     ],
     timezone: "America/Los_Angeles"}
  ],
  queues: [
    first_queue: 1,
    second_queue: 1,
    third_queue: 1
  ]

Worker config (each of the 3 worker types are very similar except for the queue name):

  use Oban.Worker,
    queue: :first_queue,
    max_attempts: 5,
    unique: [period: :infinity, states: [:available, :scheduled, :executing, :retryable]]

  @backoff_seconds 60 * 60

I ran Oban.check_queue on them and verified the queues were not paused. They run successfully when I iex in and manually run Oban.drain_queue(queue: :first_queue) etc.

sorentwo commented 2 years ago

@mikeover

  1. How many nodes are you running?
  2. Do cron jobs, pruning, etc. work as expected?
  3. Are you using pgbouncer?
  4. Do you typically have more queue activity?
mikeover commented 2 years ago

@sorentwo Thanks for the quick response!

  1. Single node
  2. Cron jobs work and get put into the "available" state at the expected time. Pruner also works.
  3. No (at least I don't believe so)
  4. There's not a lot of queue activity, these daily jobs, some monthly jobs, other low frequency ad hoc jobs
kafaichoi commented 2 years ago

Hi @sorentwo . Thank you for this great library.

Mike's colleague here. We are actually running 2 oban node, one for each releases applications(total 2 different applications). The first one is the core application(app A) that run oban with queue while the another one is more like a internal app(app B) where we run oban as well but disable queue and oban web. (We don't want to run oban web in app A for some security issue where we don't want to expose oban web dashboard to appA)

We have managed to "fix" the bug by setting peer: false in the app B. I believed the issue(available jobs are "stuck") we had previously is Stager is running as well in the app B where queues: [](We cannot set plugins: false since we need 2 plugins to make oban web working. So essentially both Stagers(from app A and app B) are competing with each other and if Stager from app B win, it will transit job to available but no queue producer is notified since queues: false is set in app B on purpose.

So by setting peer: false in the app B, Stager will not proceed its functionality(searching for scheduled jobs that's ready to be transited to available) and thus we don't have the previous issue anymore.

However, I feel like it might not be the optimal solution and a bit hacky. It will be awesome if we can hear your opinion on our approach.

sorentwo commented 2 years ago

@bruteforcecat Considering your setup your solution isn't hacky. However, your analysis of how the stager works is slightly off. The stager plugin runs on any node that doesn't have plugins: false, but it will only stage and notify about jobs on the leader node. We'll add a section about this to the troubleshooting guide, since it bites people with more advanced setups like yours.

Also, the node running Oban Web only needs to run the Stats plugin if it isn't running any queues. The Gossip plugin only needs to run on a node that runs queues. The relationship is that gossip broadcasts information about the queues and stats receives them, essentially.

kafaichoi commented 2 years ago

@sorentwo Thank you for your explanation :) Unfortunately, we are still having issue even only running one node(app A) after unmounting oban_web from app B and we are NOT using pgbouncer. We are now trying to switch notifier to pg2 and if the stuck job issue is resolved after that. Does it means it's 100% related to postgres notify?

Btw we are having some random db disconnection issue(before integrating with oban). After briefly looking into Oban.Notifier.Postgres source code, the process should be terminated and restarted if disconnection happen so it shouldn't cause the stuck available job issue or is there sth that I might overlook?

sorentwo commented 2 years ago

Does it means it's 100% related to postgres notify?

That seems most likely. It would be helpful if you could add instrumentation to verify that the stager is executing. Even some simple logging like this would be helpful:

tracer = fn _event, _measure, meta, _conf ->
  if meta.plugin == Oban.Plugins.Stager do
    Logger.info(meta)
  end
end

:telemetry.attach("oban-plugins", [:oban, :plugin, :stop], tracer, [])

After briefly looking into Oban.Notifier.Postgres source code, the process should be terminated and restarted if disconnection happen so it shouldn't cause the stuck available job issue or is there sth that I might overlook?

That's correct. It will automatically reconnect after a random disconnection.

maynewong commented 2 years ago

@bruteforcecat

We also encountered the same problem

By debugging Oban and Postgex, we found that the problem was caused by broken TCP connection on the postgres listen side.

Laurenz Albe wrote an article on using TCP keepalive to detect dead connections https://www.cybertec-postgresql.com/en/tcp-keepalive-for-a-better-postgresql-experience/

We modified the configuration along the lines of that article

config :my_app, MyApp.Repo,
  parameters: [
    tcp_keepalives_idle: "60",
    tcp_keepalives_interval: "5",
    tcp_keepalives_count: "3"
  ],
  socket_options: [keepalive: true]

The production environment has been running for a few weeks now and has not encountered this problem again

jotaviobiondo commented 2 years ago

Hey @sorentwo! First of all thanks for the amazing work on this lib! It help us a lot in the company :)

I'm having the same problem, our jobs becomes stuck in the available state after a while. Our setup:

We are running on umbrella applications, so it's more or less like this:

We are running the app only in one node (both core and importer are in the same node)

We were using version Oban 2.7 and it's all running fine, but then we upgraded to version 2.13 and the jobs in the importer app get stuck after a while. When we restart the app the jobs gets back to being processed.

We did some debugging:

Any thoughts on what might be the problem here?

sorentwo commented 2 years ago

@jotaviobiondo It sounds like you're experiencing the broken TCP connection issue mentioned above. The stager is still running but can't broadcast job availability to the appropriate queues. That wasn't an issue back on Oban 2.7 because every queue polled for available jobs themselves.

An open issue aims to remedy this problem automatically: https://github.com/sorentwo/oban/issues/769.

In the meantime, you have a couple of options:

  1. Tune the TCP keepalive as described above
  2. Use the PG notifier (only available if you're running with Distributed Erlang)
jotaviobiondo commented 2 years ago

Thanks for the quick reply! Yes, it makes sense, we noticed that the queue is stopping when it does not process anything for a while.

We will use the keepalive solution for now, thanks for the help!