Closed 0x6a68 closed 3 years ago
I had this problem too in Oban 2.6, it looks like it was caused by cancel_job
- I use that a lot. But I believe this was already fixed in 2.7?
@0x6a68 The oban_producers
output shows that you're using rate limiting. The curr_count
for each node is 20, and the allowed
limit is 40. Based on that, I'd expect that it wasn't running more jobs at that particular moment.
Is this an ongoing problem? Do jobs run initially and then stop?
Hello! I'm in the same project as @0x6a68
Is this an ongoing problem?
Yes, it keeps happening. We're migrating our email sending from Rihanna to Oban right now. Every morning we sent out a whole bunch of emails from our staging environment, and almost every day somebody doesn't receive all of the emails.
Do jobs run initially and then stop?
Yes, a few jobs in the queue do get executed and then the queue stops for no apparent reason. For example, yesterday when this happened, the queue got unstuck after the application restarted itself after a deployment.
This morning for example, we have our "mailers" queue stuck with 112 jobs available and none executing.
Debug queries:
# select id, state, attempt, max_attempts, errors, inserted_at, scheduled_at, attempted_at, completed_at, discarded_at from oban_jobs where queue='mailers' and state='available';
#
# query run around 2021-05-29 05:30:00
#
id | state | attempt | max_attempts | errors | inserted_at | scheduled_at | attempted_at | completed_at | discarded_at
------+-----------+---------+--------------+--------+----------------------------+----------------------------+--------------+--------------+--------------
1899 | available | 0 | 3 | {} | 2021-05-29 03:00:03.387879 | 2021-05-29 03:00:03.387879 | | |
1923 | available | 0 | 3 | {} | 2021-05-29 03:00:03.645091 | 2021-05-29 03:00:03.645091 | | |
1844 | available | 0 | 3 | {} | 2021-05-29 03:00:02.86531 | 2021-05-29 03:00:02.86531 | | |
1892 | available | 0 | 3 | {} | 2021-05-29 03:00:03.385977 | 2021-05-29 03:00:03.385977 | | |
1905 | available | 0 | 3 | {} | 2021-05-29 03:00:03.482801 | 2021-05-29 03:00:03.482801 | | |
1853 | available | 0 | 3 | {} | 2021-05-29 03:00:02.893853 | 2021-05-29 03:00:02.893853 | | |
1893 | available | 0 | 3 | {} | 2021-05-29 03:00:03.400265 | 2021-05-29 03:00:03.400265 | | |
1906 | available | 0 | 3 | {} | 2021-05-29 03:00:03.501134 | 2021-05-29 03:00:03.501134 | | |
1925 | available | 0 | 3 | {} | 2021-05-29 03:00:03.661465 | 2021-05-29 03:00:03.661465 | | |
1841 | available | 0 | 3 | {} | 2021-05-29 03:00:02.803508 | 2021-05-29 03:00:02.803508 | | |
1858 | available | 0 | 3 | {} | 2021-05-29 03:00:03.007793 | 2021-05-29 03:00:03.007793 | | |
1884 | available | 0 | 3 | {} | 2021-05-29 03:00:03.233995 | 2021-05-29 03:00:03.233995 | | |
1912 | available | 0 | 3 | {} | 2021-05-29 03:00:03.529712 | 2021-05-29 03:00:03.529712 | | |
1840 | available | 0 | 3 | {} | 2021-05-29 03:00:02.729769 | 2021-05-29 03:00:02.729769 | | |
1876 | available | 0 | 3 | {} | 2021-05-29 03:00:03.084729 | 2021-05-29 03:00:03.084729 | | |
1894 | available | 0 | 3 | {} | 2021-05-29 03:00:03.392712 | 2021-05-29 03:00:03.392712 | | |
1854 | available | 0 | 3 | {} | 2021-05-29 03:00:02.888188 | 2021-05-29 03:00:02.888188 | | |
1859 | available | 0 | 3 | {} | 2021-05-29 03:00:03.007753 | 2021-05-29 03:00:03.007753 | | |
1909 | available | 0 | 3 | {} | 2021-05-29 03:00:03.522436 | 2021-05-29 03:00:03.522436 | | |
1934 | available | 0 | 3 | {} | 2021-05-29 03:00:03.780278 | 2021-05-29 03:00:03.780278 | | |
1867 | available | 0 | 3 | {} | 2021-05-29 03:00:03.00419 | 2021-05-29 03:00:03.00419 | | |
1926 | available | 0 | 3 | {} | 2021-05-29 03:00:03.746323 | 2021-05-29 03:00:03.746323 | | |
1939 | available | 0 | 3 | {} | 2021-05-29 03:00:03.803615 | 2021-05-29 03:00:03.803615 | | |
1868 | available | 0 | 3 | {} | 2021-05-29 03:00:03.011569 | 2021-05-29 03:00:03.011569 | | |
1890 | available | 0 | 3 | {} | 2021-05-29 03:00:03.255833 | 2021-05-29 03:00:03.255833 | | |
1938 | available | 0 | 3 | {} | 2021-05-29 03:00:03.792717 | 2021-05-29 03:00:03.792717 | | |
1839 | available | 0 | 3 | {} | 2021-05-29 03:00:02.816816 | 2021-05-29 03:00:02.816816 | | |
1864 | available | 0 | 3 | {} | 2021-05-29 03:00:03.004659 | 2021-05-29 03:00:03.004659 | | |
1887 | available | 0 | 3 | {} | 2021-05-29 03:00:03.242201 | 2021-05-29 03:00:03.242201 | | |
1889 | available | 0 | 3 | {} | 2021-05-29 03:00:03.307856 | 2021-05-29 03:00:03.307856 | | |
1935 | available | 0 | 3 | {} | 2021-05-29 03:00:03.767265 | 2021-05-29 03:00:03.767265 | | |
1835 | available | 0 | 3 | {} | 2021-05-29 03:00:02.811579 | 2021-05-29 03:00:02.811579 | | |
1855 | available | 0 | 3 | {} | 2021-05-29 03:00:02.908493 | 2021-05-29 03:00:02.908493 | | |
1862 | available | 0 | 3 | {} | 2021-05-29 03:00:03.005065 | 2021-05-29 03:00:03.005065 | | |
1891 | available | 0 | 3 | {} | 2021-05-29 03:00:03.398024 | 2021-05-29 03:00:03.398024 | | |
1910 | available | 0 | 3 | {} | 2021-05-29 03:00:03.525756 | 2021-05-29 03:00:03.525756 | | |
1928 | available | 0 | 3 | {} | 2021-05-29 03:00:03.739964 | 2021-05-29 03:00:03.739964 | | |
1831 | available | 0 | 3 | {} | 2021-05-29 03:00:02.802003 | 2021-05-29 03:00:02.802003 | | |
1866 | available | 0 | 3 | {} | 2021-05-29 03:00:03.010031 | 2021-05-29 03:00:03.010031 | | |
1907 | available | 0 | 3 | {} | 2021-05-29 03:00:03.521986 | 2021-05-29 03:00:03.521986 | | |
1927 | available | 0 | 3 | {} | 2021-05-29 03:00:03.747317 | 2021-05-29 03:00:03.747317 | | |
1863 | available | 0 | 3 | {} | 2021-05-29 03:00:03.008401 | 2021-05-29 03:00:03.008401 | | |
1931 | available | 0 | 3 | {} | 2021-05-29 03:00:03.746731 | 2021-05-29 03:00:03.746731 | | |
1832 | available | 0 | 3 | {} | 2021-05-29 03:00:02.806329 | 2021-05-29 03:00:02.806329 | | |
1860 | available | 0 | 3 | {} | 2021-05-29 03:00:03.004569 | 2021-05-29 03:00:03.004569 | | |
1883 | available | 0 | 3 | {} | 2021-05-29 03:00:03.21724 | 2021-05-29 03:00:03.21724 | | |
1936 | available | 0 | 3 | {} | 2021-05-29 03:00:03.789209 | 2021-05-29 03:00:03.789209 | | |
1875 | available | 0 | 3 | {} | 2021-05-29 03:00:03.028315 | 2021-05-29 03:00:03.028315 | | |
1886 | available | 0 | 3 | {} | 2021-05-29 03:00:03.250968 | 2021-05-29 03:00:03.250968 | | |
1911 | available | 0 | 3 | {} | 2021-05-29 03:00:03.528702 | 2021-05-29 03:00:03.528702 | | |
1930 | available | 0 | 3 | {} | 2021-05-29 03:00:03.741011 | 2021-05-29 03:00:03.741011 | | |
1856 | available | 0 | 3 | {} | 2021-05-29 03:00:02.996979 | 2021-05-29 03:00:02.996979 | | |
1885 | available | 0 | 3 | {} | 2021-05-29 03:00:03.216847 | 2021-05-29 03:00:03.216847 | | |
1932 | available | 0 | 3 | {} | 2021-05-29 03:00:03.758266 | 2021-05-29 03:00:03.758266 | | |
1857 | available | 0 | 3 | {} | 2021-05-29 03:00:03.003571 | 2021-05-29 03:00:03.003571 | | |
1908 | available | 0 | 3 | {} | 2021-05-29 03:00:03.522313 | 2021-05-29 03:00:03.522313 | | |
1933 | available | 0 | 3 | {} | 2021-05-29 03:00:03.744056 | 2021-05-29 03:00:03.744056 | | |
1824 | available | 0 | 3 | {} | 2021-05-29 03:00:02.707833 | 2021-05-29 03:00:02.707833 | | |
1842 | available | 0 | 3 | {} | 2021-05-29 03:00:02.773292 | 2021-05-29 03:00:02.773292 | | |
1873 | available | 0 | 3 | {} | 2021-05-29 03:00:03.068901 | 2021-05-29 03:00:03.068901 | | |
1877 | available | 0 | 3 | {} | 2021-05-29 03:00:03.180253 | 2021-05-29 03:00:03.180253 | | |
1901 | available | 0 | 3 | {} | 2021-05-29 03:00:03.411405 | 2021-05-29 03:00:03.411405 | | |
1916 | available | 0 | 3 | {} | 2021-05-29 03:00:03.550459 | 2021-05-29 03:00:03.550459 | | |
1845 | available | 0 | 3 | {} | 2021-05-29 03:00:02.884677 | 2021-05-29 03:00:02.884677 | | |
1865 | available | 0 | 3 | {} | 2021-05-29 03:00:03.004595 | 2021-05-29 03:00:03.004595 | | |
1882 | available | 0 | 3 | {} | 2021-05-29 03:00:03.19536 | 2021-05-29 03:00:03.19536 | | |
1929 | available | 0 | 3 | {} | 2021-05-29 03:00:03.744759 | 2021-05-29 03:00:03.744759 | | |
1830 | available | 0 | 3 | {} | 2021-05-29 03:00:02.730173 | 2021-05-29 03:00:02.730173 | | |
1847 | available | 0 | 3 | {} | 2021-05-29 03:00:02.871844 | 2021-05-29 03:00:02.871844 | | |
1879 | available | 0 | 3 | {} | 2021-05-29 03:00:03.183359 | 2021-05-29 03:00:03.183359 | | |
1902 | available | 0 | 3 | {} | 2021-05-29 03:00:03.402377 | 2021-05-29 03:00:03.402377 | | |
1913 | available | 0 | 3 | {} | 2021-05-29 03:00:03.544816 | 2021-05-29 03:00:03.544816 | | |
1833 | available | 0 | 3 | {} | 2021-05-29 03:00:02.758248 | 2021-05-29 03:00:02.758248 | | |
1871 | available | 0 | 3 | {} | 2021-05-29 03:00:03.060362 | 2021-05-29 03:00:03.060362 | | |
1895 | available | 0 | 3 | {} | 2021-05-29 03:00:03.388636 | 2021-05-29 03:00:03.388636 | | |
1917 | available | 0 | 3 | {} | 2021-05-29 03:00:03.583812 | 2021-05-29 03:00:03.583812 | | |
1843 | available | 0 | 3 | {} | 2021-05-29 03:00:02.775372 | 2021-05-29 03:00:02.775372 | | |
1869 | available | 0 | 3 | {} | 2021-05-29 03:00:03.047597 | 2021-05-29 03:00:03.047597 | | |
1900 | available | 0 | 3 | {} | 2021-05-29 03:00:03.387556 | 2021-05-29 03:00:03.387556 | | |
1921 | available | 0 | 3 | {} | 2021-05-29 03:00:03.64075 | 2021-05-29 03:00:03.64075 | | |
1838 | available | 0 | 3 | {} | 2021-05-29 03:00:02.753987 | 2021-05-29 03:00:02.753987 | | |
1874 | available | 0 | 3 | {} | 2021-05-29 03:00:03.061616 | 2021-05-29 03:00:03.061616 | | |
1878 | available | 0 | 3 | {} | 2021-05-29 03:00:03.182445 | 2021-05-29 03:00:03.182445 | | |
1896 | available | 0 | 3 | {} | 2021-05-29 03:00:03.394886 | 2021-05-29 03:00:03.394886 | | |
1915 | available | 0 | 3 | {} | 2021-05-29 03:00:03.552317 | 2021-05-29 03:00:03.552317 | | |
1820 | available | 0 | 3 | {} | 2021-05-29 03:00:02.689917 | 2021-05-29 03:00:02.689917 | | |
1834 | available | 0 | 3 | {} | 2021-05-29 03:00:02.808906 | 2021-05-29 03:00:02.808906 | | |
1851 | available | 0 | 3 | {} | 2021-05-29 03:00:02.892257 | 2021-05-29 03:00:02.892257 | | |
1861 | available | 0 | 3 | {} | 2021-05-29 03:00:03.002391 | 2021-05-29 03:00:03.002391 | | |
1881 | available | 0 | 3 | {} | 2021-05-29 03:00:03.195865 | 2021-05-29 03:00:03.195865 | | |
1937 | available | 0 | 3 | {} | 2021-05-29 03:00:03.785719 | 2021-05-29 03:00:03.785719 | | |
1852 | available | 0 | 3 | {} | 2021-05-29 03:00:02.867351 | 2021-05-29 03:00:02.867351 | | |
1904 | available | 0 | 3 | {} | 2021-05-29 03:00:03.461122 | 2021-05-29 03:00:03.461122 | | |
1920 | available | 0 | 3 | {} | 2021-05-29 03:00:03.641171 | 2021-05-29 03:00:03.641171 | | |
1821 | available | 0 | 3 | {} | 2021-05-29 03:00:02.689906 | 2021-05-29 03:00:02.689906 | | |
1880 | available | 0 | 3 | {} | 2021-05-29 03:00:03.166848 | 2021-05-29 03:00:03.166848 | | |
1888 | available | 0 | 3 | {} | 2021-05-29 03:00:03.308632 | 2021-05-29 03:00:03.308632 | | |
1837 | available | 0 | 3 | {} | 2021-05-29 03:00:02.753371 | 2021-05-29 03:00:02.753371 | | |
1849 | available | 0 | 3 | {} | 2021-05-29 03:00:02.86261 | 2021-05-29 03:00:02.86261 | | |
1872 | available | 0 | 3 | {} | 2021-05-29 03:00:03.07127 | 2021-05-29 03:00:03.07127 | | |
1919 | available | 0 | 3 | {} | 2021-05-29 03:00:03.636488 | 2021-05-29 03:00:03.636488 | | |
1850 | available | 0 | 3 | {} | 2021-05-29 03:00:02.87714 | 2021-05-29 03:00:02.87714 | | |
1898 | available | 0 | 3 | {} | 2021-05-29 03:00:03.393738 | 2021-05-29 03:00:03.393738 | | |
1903 | available | 0 | 3 | {} | 2021-05-29 03:00:03.437043 | 2021-05-29 03:00:03.437043 | | |
1922 | available | 0 | 3 | {} | 2021-05-29 03:00:03.643544 | 2021-05-29 03:00:03.643544 | | |
1848 | available | 0 | 3 | {} | 2021-05-29 03:00:02.869857 | 2021-05-29 03:00:02.869857 | | |
1918 | available | 0 | 3 | {} | 2021-05-29 03:00:03.576678 | 2021-05-29 03:00:03.576678 | | |
1846 | available | 0 | 3 | {} | 2021-05-29 03:00:02.858468 | 2021-05-29 03:00:02.858468 | | |
1870 | available | 0 | 3 | {} | 2021-05-29 03:00:03.062113 | 2021-05-29 03:00:03.062113 | | |
1897 | available | 0 | 3 | {} | 2021-05-29 03:00:03.412495 | 2021-05-29 03:00:03.412495 | | |
1914 | available | 0 | 3 | {} | 2021-05-29 03:00:03.542791 | 2021-05-29 03:00:03.542791 | | |
1924 | available | 0 | 3 | {} | 2021-05-29 03:00:03.64091 | 2021-05-29 03:00:03.64091 | | |
(112 rows)
# select * from oban_producers where queue='mailers';
uuid | name | node | queue | meta | running_ids | started_at | updated_at
--------------------------------------+------+----------------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------+----------------------------
f89ac10f-dcf6-4871-97d3-79abc95e3d1e | Oban | ypsilon-staging@app02-prod | mailers | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 24, "prev_count": 0, "window_time": "19:50:06"}, "local_limit": 10, "global_limit": 10} | {} | 2021-05-28 13:37:31.459809 | 2021-05-29 05:30:21.840363
8d41bfe7-9efb-4835-bf9b-147664f84760 | Oban | ypsilon-staging@app01-prod | mailers | {"paused": false, "rate_limit": {"period": 1, "allowed": 40, "curr_count": 16, "prev_count": 0, "window_time": "19:50:06"}, "local_limit": 10, "global_limit": 10} | {} | 2021-05-28 13:36:34.792854 | 2021-05-29 05:30:21.847764
(2 rows)
@0x6a68 @angelikatyborska Thanks for the rate limit details. That proved to be the key! The issue stemmed from an incorrect window calculation when the queue was inactive for a while, particularly when the queue was inactive spanning midnight.
Expect a new patch release soon 💛
I had stuck jobs in available state again today. Running on:
Oban v2.7.1 Oban.Web v2.7.0 Oban.Pro v0.8.0 Elixir 1.12.1
After restarting the app, jobs were picked up.
Stuck again. This is how the producers table is now, with 922 stuck jobs on the same queue (3 workers).
select * from oban_producers where queue='page_creation';
uuid | name | node | queue | meta | running_ids | started_at | updated_at
--------------------------------------+------+-------+---------------+-------------------------------------+-------------+----------------------------+---------------------------
6b7a6400-4042-488f-b4da-501d2a260260 | Oban | web.1 | page_creation | {"paused": false, "local_limit": 3} | {} | 2021-06-01 12:13:27.106366 | 2021-06-01 12:38:29.03955
In my case the queue hasn't been inactive, there was a lot of movement going on there.
use Oban.Worker,
queue: "page_creation",
max_attempts: 25,
unique: [
fields: [:args, :worker],
keys: [:sitemap_id, :url],
states: [:scheduled, :available, :executing, :retryable],
period: :infinity
]
Restarting the server makes it pick up the jobs.
Yeah, same here :)
Patch hasn't been released.
OK in that case I'll put a monitor in place to check for stuck jobs and run drain_queue
if that situation is detected...
In our case we used the Oban.Plugins.Repeater
https://hexdocs.pm/oban/Oban.Plugins.Repeater.html
Thanks, I tried the Repeater but it did not avoid jobs being stuck again.
Just updated to Oban Web 2.7.1 and Oban Pro 0.8.1, I hope that fixes the issue 🤞
We ran into the same issue last night (and the night before).
We're using: Oban 2.9.2, Oban Pro 0.9.2 and Oban Web 2.7.4.
Should this issue be fixed now or is this still an open issue?
Should this issue be fixed now or is this still an open issue?
This isn't still an open issue as far as I've heard. If you share more about your config, which queue is stuck (or all), etc. then I can diagnose the issue and possibly give some advice now.
@sorentwo I am @janwillemvd 's colleague
Config:
[
engine: Oban.Pro.Queue.SmartEngine,
repo: MyApp.Repo,
plugins: [
Oban.Plugins.Gossip,
{Oban.Pro.Plugins.DynamicPruner, queue_overrides: [my_queue: {:max_age, {2, :weeks}}]},
Oban.Pro.Plugins.Lifeline,
Oban.Web.Plugins.Stats
],
queues: [
my_queue: [
global_limit: 10,
rate_limit: [allowed: 1, period: {1, :second}]
]
]
]
Worker:
...
use Oban.Worker,
queue: :my_queue,
max_attempts: 3,
unique: [states: [:available, :scheduled, :retryable]]
@impl true
def backoff(%Oban.Job{attempt: attempt}) do
trunc(:math.pow(attempt, 4) + :rand.uniform(20) * attempt)
end
...
We restored the database to the point in time just before we paused and resumed Oban. At that time, the oban_producers
table was empty.
@janwillemvd @matthijsqd Do you have a record of any errors, e.g. the producer crashing?
Hi @sorentwo, we've checked the logs. Apart from some job errors, we see no errors of the producer crashing.
Without more details, I don't have much to go off of. A few more questions:
Any other details will really help me to diagnose and advise.
We've only just deployed this project on 2022-01-04, since then it happened twice:
2022-01-05 00:35:21.847318
, the last job inserted before that was inserted at 2022-01-04 22:55:29.952330
2022-01-06 02:16:28.343094
, the last job inserted before that was inserted at 2022-01-05 22:34:27.118306
There is only one queue in this project. We have since disabled the rate limiting on this queue and it has not occurred since.
Also, is it normal that the oban_producers table gets cleared? If so, when does that happen?
There is only one queue in this project. We have since disabled the rate-limiting on this queue, and it has not occurred since.
Ok, that seems telling. My hunch is that the producer crashed during a rate-limit update. I'll investigate and try to recreate that situation.
Also, is it normal that the oban_producers table gets cleared? If so, when does that happen?
Yes, that's entirely normal. However, producer records are ephemeral and only need to live as long as the queue is alive. So old producers, any producer that hasn't refreshed (pinged) recently, is a candidate for deletion.
My hunch is that the producer crashed during a rate-limit update. I'll investigate and try to recreate that situation.
Thanks @sorentwo, please let us know if we can help or you want us to test something.
@janwillemvd I had the same issue. Just figured out why.
I ran Oban.check_queue/1
then got this error:
** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "default"}}}}, :check, 5000)
** (EXIT) an exception was raised:
** (Postgrex.Error) ERROR 23514 (check_violation) new row for relation "oban_jobs" violates check constraint "attempt_range"
table: oban_jobs
constraint: attempt_range
Failing row contains (2208032, default, App.Jobs.SyncInvoiceJob, {"invoice_id": "<invoice_id>"}, {"{\"at\": \"2022-02-23T02:38:22.859813Z\", \"error\": \"** (Und..., 3, 2, 2022-02-23 02:38:20.344931, 2022-02-23 02:38:38.859775, 2022-02-23 07:54:20.42105, null, {app@127.0.0.1,6eacdd69-adc6-470a-ab4f-75161237d360}, 2022-02-23 02:38:39.350624, 0, {}, {}, null, executing).
Then I checked this record (2208032), the attempt and max_attempts are both 2 and state is available. But it should be discarded
or completed
, should not be available
. I manually updated the state to discarded
and it worked.
Hope this helps!
Thanks @kenzhengguan, but I think your example differs from my issue.
@sorentwo Did you maybe find some time to investigate this issue further? Can we help with anything?
@janwillemvd Sorry, I missed your previous comment. Since this conversation I diagnosed a sneaky cause of stuck jobs. It came down to a lingering transactional advisory lock that stopped any job availability announcements. In most cases a redeploy fixed the lock, and then jobs would run again.
In v2.11 the stager doesn’t use transactional locks at all. Have you tried running that version?
I’ll ask for more info and try to replicate if jobs still don’t run for you with rate limiting enabled.
No worries @sorentwo, thanks!
We'll upgrade to v2.11 in the upcoming weeks and I'll definitely let you know whether this solved our problem (of course, we'll enable the rate limiting again in v2.11).
I'm seeing this same issue with Oban v2.12.0 and Oban.Web v2.9.1. We have a few daily jobs that kick off at the same time each day; for the past 4 straight days they go into the Available state and they sit there (nothing executing). Running drain_queue
on them works fine and they run successfully (as does restarting the server) but otherwise they seem to just be stuck. Not using any kind of rate limiting.
Config:
config :my_app, Oban,
repo: MyApp.Repo,
plugins: [
# retain cancelled, completed, discarded jobs for 60 days(in second)
{Oban.Plugins.Pruner, max_age: 60 * 24 * 60 * 60},
{Oban.Plugins.Cron,
crontab: [
{"0 6 * * *", MyWorker1, queue: :first_queue},
{"0 6 * * *", MyWorker2, queue: :second_queue},
{"0 6 * * *", MyWorker3, queue: :third_queue}
],
timezone: "America/Los_Angeles"}
],
queues: [
first_queue: 1,
second_queue: 1,
third_queue: 1
]
Worker config (each of the 3 worker types are very similar except for the queue name):
use Oban.Worker,
queue: :first_queue,
max_attempts: 5,
unique: [period: :infinity, states: [:available, :scheduled, :executing, :retryable]]
@backoff_seconds 60 * 60
I ran Oban.check_queue
on them and verified the queues were not paused. They run successfully when I iex in and manually run Oban.drain_queue(queue: :first_queue)
etc.
@mikeover
@sorentwo Thanks for the quick response!
Hi @sorentwo . Thank you for this great library.
Mike's colleague here. We are actually running 2 oban node, one for each releases applications(total 2 different applications). The first one is the core application(app A) that run oban with queue while the another one is more like a internal app(app B) where we run oban as well but disable queue and oban web. (We don't want to run oban web in app A for some security issue where we don't want to expose oban web dashboard to appA)
We have managed to "fix" the bug by setting peer: false
in the app B. I believed the issue(available jobs are "stuck") we had previously is Stager
is running as well in the app B where queues: []
(We cannot set plugins: false
since we need 2 plugins to make oban web working. So essentially both Stagers(from app A and app B) are competing with each other and if Stager from app B win, it will transit job to available but no queue producer is notified since queues: false
is set in app B on purpose.
So by setting peer: false
in the app B, Stager will not proceed its functionality(searching for scheduled jobs that's ready to be transited to available) and thus we don't have the previous issue anymore.
However, I feel like it might not be the optimal solution and a bit hacky. It will be awesome if we can hear your opinion on our approach.
@bruteforcecat Considering your setup your solution isn't hacky. However, your analysis of how the stager works is slightly off. The stager plugin runs on any node that doesn't have plugins: false
, but it will only stage and notify about jobs on the leader node. We'll add a section about this to the troubleshooting guide, since it bites people with more advanced setups like yours.
Also, the node running Oban Web only needs to run the Stats
plugin if it isn't running any queues. The Gossip
plugin only needs to run on a node that runs queues. The relationship is that gossip broadcasts information about the queues and stats receives them, essentially.
@sorentwo Thank you for your explanation :) Unfortunately, we are still having issue even only running one node(app A) after unmounting oban_web from app B and we are NOT using pgbouncer. We are now trying to switch notifier to pg2 and if the stuck job issue is resolved after that. Does it means it's 100% related to postgres notify?
Btw we are having some random db disconnection issue(before integrating with oban). After briefly looking into Oban.Notifier.Postgres
source code, the process should be terminated and restarted if disconnection happen so it shouldn't cause the stuck available job issue or is there sth that I might overlook?
Does it means it's 100% related to postgres notify?
That seems most likely. It would be helpful if you could add instrumentation to verify that the stager is executing. Even some simple logging like this would be helpful:
tracer = fn _event, _measure, meta, _conf ->
if meta.plugin == Oban.Plugins.Stager do
Logger.info(meta)
end
end
:telemetry.attach("oban-plugins", [:oban, :plugin, :stop], tracer, [])
After briefly looking into Oban.Notifier.Postgres source code, the process should be terminated and restarted if disconnection happen so it shouldn't cause the stuck available job issue or is there sth that I might overlook?
That's correct. It will automatically reconnect after a random disconnection.
@bruteforcecat
We also encountered the same problem
By debugging Oban and Postgex, we found that the problem was caused by broken TCP connection on the postgres listen side.
Laurenz Albe wrote an article on using TCP keepalive to detect dead connections https://www.cybertec-postgresql.com/en/tcp-keepalive-for-a-better-postgresql-experience/
We modified the configuration along the lines of that article
config :my_app, MyApp.Repo,
parameters: [
tcp_keepalives_idle: "60",
tcp_keepalives_interval: "5",
tcp_keepalives_count: "3"
],
socket_options: [keepalive: true]
The production environment has been running for a few weeks now and has not encountered this problem again
Hey @sorentwo! First of all thanks for the amazing work on this lib! It help us a lot in the company :)
I'm having the same problem, our jobs becomes stuck in the available state after a while. Our setup:
We are running on umbrella applications, so it's more or less like this:
public
schema on Postgres. Only a few jobs, nothing intensive hereimporter
schema on Postgres. Here we have more jobs, but not that much, around 500 a dayWe are running the app only in one node (both core and importer are in the same node)
We were using version Oban 2.7 and it's all running fine, but then we upgraded to version 2.13 and the jobs in the importer
app get stuck after a while. When we restart the app the jobs gets back to being processed.
We did some debugging:
Stager
plugin kept working after the queue stopped (the logs didn't stopped)Oban.Peer
, since it's the bigger change (I think haha) from v2.7. But after the queue stopped, we saw the it's all fine with the leadership (the node continued to be the leader and the table oban_peers
was being updated)Any thoughts on what might be the problem here?
@jotaviobiondo It sounds like you're experiencing the broken TCP connection issue mentioned above. The stager is still running but can't broadcast job availability to the appropriate queues. That wasn't an issue back on Oban 2.7 because every queue polled for available jobs themselves.
An open issue aims to remedy this problem automatically: https://github.com/sorentwo/oban/issues/769.
In the meantime, you have a couple of options:
Thanks for the quick reply! Yes, it makes sense, we noticed that the queue is stopping when it does not process anything for a while.
We will use the keepalive solution for now, thanks for the help!
Environment
elixir --version
): Elixir 1.12.0 (compiled with Erlang/OTP 24) Also happens on previous versionsCurrent Behavior
Jobs are stuck in the available state only manual
drain_queue/2
moves these jobs.DEBUG