joomla-projects / soc21_website-cronjob

GNU General Public License v2.0
5 stars 3 forks source link

Words of advice from someone who implemented this in 2010 and discontinued it in 2013 #39

Closed nikosdion closed 2 years ago

nikosdion commented 3 years ago

I took a look at the plugin and you are doing exactly what I was doing eleven years ago with the Akeeba Backup Lazy Scheduling plugin, right down to injecting JavaScript to avoid the page load performance penalty and calling it “lazy scheduling”. I suppose the one of you was too young to remember that and the other may not remember it or ever used it.

Here are the reasons why lazy scheduling does not work in the real world:

  1. You cannot run long running tasks without PHP timing out. How long is “long”? Well, that depends on the PHP maximum execution time. Sure, you can set an infinite time limit, if the host has not disabled it. Then you are limited by the maximum time the server will keep the connection to the PHP process open. If you are using PHP-FPM on Apache this is about 120 seconds. But you may still time out because of ulimits namely the maximum CPU time a process can use before being SIGTERM'ed by the kernel. Realistically speaking, anything over 10 to 20 seconds is unlikely to survive.
  2. You depend on user traffic to execute CRON jobs. This is only reliable on sites with a steady stream of traffic throughout the day. In our experience the majority of sites trying to use lazy scheduling are those which receive very little and inconsistent traffic. A CRON job may not execute until hours or days past its scheduled time, daily tasks may only run a few times every week / month etc.
  3. If I understand this correctly, you are doing task recovery when the task is found to be locked. I have this tingling feeling that this open you to the possibility of unlocking a locked task while it's still running. Also, since the lock is implemented in PHP and MySQL there is a window of opportunity of around 100msec on cheap share hosting where another client will also be able to acquire a lock because you are not locking the table for writing before trying to acquire a lock. See point 6 below.
  4. If you use caching your JavaScript becomes unreliable. This is probably a lesser concern with your design but I can't really tell because at this point in time it seems that some key bits are missing...
  5. If you are using this on a site behind a caching proxy / CDN it's not guaranteed that the response you get will be from the server having processed the next scheduled task as opposed to from the caching proxy or CDN. Yes, Joomla sends the right HTTP headers but I know of at least one third party cache (Litespeed Cache) which ignores them and breaks the lazy scheduling.
  6. Since your JavaScript is running potentially on dozens of clients simultaneously you have a concurrency issue. You cannot guarantee that two requests won't come close enough to each other that they will both end up running the same task. Solving the concurrency requires having an atomic FIFO queue. You cannot really fake this with MySQL, especially since you're not locking the tables to minimize the possibility of concurrent requests acquiring simultaneous locks. Having truly atomic FIFO queues requires something like Redis which is of course incompatible with the target audience of the lazy scheduling solution.
  7. If the browser process is closed while a CRON job is running the TCP/IP socket to the server is closed. This sends a user abort signal to PHP. Yes, you do use ignore_user_abort to work around that but this may not be enough if there is any output, e.g. a PHP notice. Remember that the PHP docs say that the User Abort signal is only honored until PHP tries to send anything to the output, at which point it discovers the pipe is broken and goes belly up. In this case the CRON task will be terminated prematurely with unpredictable results. As it happens, sites on craphosts are also the sites where you see atrocities like error reporting set to maximum.

Further to that I have concerns about the security of this solution. Take this with a grain of salt as there seem to be some bits missing and I am trying to speculate on the intended design, not the partial design I am seeing in the code. You are sending a static key to a potentially untrusted client. This static key is the only thing that prevents someone from running any scheduled task repeatedly. I cannot fully verify this since the JavaScript references a plugin that does not exist(?) so I'm unclear on how it will actually work.

In any case, there's no point using a key when this key is going to be communicated to the client. If it goes to the client it can be intercepted. If it's not an one–off key it can be reused. If it is an one–off key it cannot work with caching (Joomla, browser, server-side or CDN/proxy) any more the anti–CSRF token does, i.e. all pages using it — therefore the entire site — cannot have its output cached. While you do deal with the Joomla caching you can do nothing about all the other caching. Effectively, the only way to operate is either without a key or with a key that's communicated to the client which is equivalent to having no key. This would allow a client to hammer the site with requests to run a scheduled task which may cause a concurrency issue (see point 6 above).

I understand what you are trying to do and why you are trying to do it. I thought that is was a clever solution, too, eleven years ago. Unfortunately, it's not. It's an unmitigated disaster waiting to happen in the real sites out there 😞

The only viable solution is to NOT have lazy scheduling AT ALL. Instead, have a scheduled task runner which can be triggered from a URL locked with a key. You can use redirections to keep running long–running tasks which do not report they are done, like I do with backups — but note that most pseudo–CRONs don't support them. The scheduling URL can be triggered by the pseudo–CRON most servers without real CRON offer that can access a URL on a schedule (ideally set up to run every minute since you're doing scheduling with PHP and MySQL) or with a third party pseudo–CRON service such as WebCRON, again like we advise our backup clients to do. Of course real CRON jobs are preferable but if someone can create CRON jobs why would they go through yet another layer of configuration in Joomla now that Joomla has a real CLI client so, yeah, that'd be pointless — I still see you pursue this and I understand why but note that it's hinged on the assumption that lazy scheduling DOES work and developers WILL use it instead of offering real CLI commands which is not gonna happen for anything non–trivial, from taking backups to sending newsletters and from importing large product stock CSVs to batch processing images and everything in between. At best you have created more work for us 3PDs who have to explain these non–obvious points to our clients as to why we don't use a scheduling solution destined to be a disaster OR implement it anyway and let our clients figure out just how disastrous it is for themselves. I plan on the latter with a BIG, FAT warning in the documentation and the plugin and redirecting all scheduling issues to you guys to resolve for I already did that eleven years ago and I'm not willing to waste my time again on something I know can't work 😉

In any case, these are my words of advice and concern from someone who did implement this eleven years ago and discontinued the feature in 2013 after seeing what kind of disaster it was for anything but the shortest, most trivial tasks which could have been far more easily be handled by using a third party pseudo–CRON service to begin with.

I understand that Joomla is trying to implement this to compete with WordPress. The thing is, WordPress' CRON jobs are CRON jobs only in name. These are all minor tasks that take a second or two. The scheduling is also simply not present. You might get these tasks to run sometime or not, pretty much. Anything more serious does not use that CRON system. Joomla is incorrectly positioning lazy scheduling as a replacement for real CRON jobs which is IRRESPONSIBLE AND IMPOSSIBLE. Been there, done that, got the t-shirt.

At the very least I'd do the following:

Sorry for the long post, here's a potato. 🥔

bembelimen commented 3 years ago

Hello Nicholas,

thank you very much that you shared your experience and ideas, we really appreciate. As always there is a lot of truth in your post. Let me clearify some points.

1 + 2

Both are issues we thought about in the past, too. Here I think is the only solution to communicate it in a better way (like you suggested below)

3. If I understand this correctly, you are doing task recovery when the task is found to be locked. I have this tingling feeling that this open you to the possibility of unlocking a locked task while it's still running.

Yep, we'll change this part by removing this recovery and add a manual option to recover (user can decide)

Also, since the lock is implemented in PHP and MySQL there is a window of opportunity of around 100msec on cheap share hosting where another client will also be able to acquire a lock because you are not locking the table for writing before trying to acquire a lock.

We've put the select/write option in one query now: https://github.com/joomla-projects/soc21_website-cronjob/pull/41/files#diff-b9f75ae33fad36d7a8968d12a2b78ec972c5e6d2c6fd7ed0024d764191ec48ecR303-R311

5. If you are using this on a site behind a caching proxy / CDN it's not guaranteed that the response you get will be from the server having processed the next scheduled task as opposed to from the caching proxy or CDN. Yes, Joomla sends the right HTTP headers but I know of at least one third party cache (Litespeed Cache) which ignores them and breaks the lazy scheduling.

Yes, I think there is not much we can do here but to communicate it.

6. Since your JavaScript is running potentially on dozens of clients simultaneously you have a concurrency issue. You cannot guarantee that two requests won't come close enough to each other that they will both end up running the same task. Solving the concurrency requires having an atomic FIFO queue. You cannot really fake this with MySQL, especially since you're not locking the tables to minimize the possibility of concurrent requests acquiring simultaneous locks. Having truly atomic FIFO queues requires something like Redis which is of course incompatible with the target audience of the lazy scheduling solution.

We tried to limit the possibility of this with the code mentions above, also now with locking tables: https://github.com/joomla-projects/soc21_website-cronjob/pull/41/files#diff-b9f75ae33fad36d7a8968d12a2b78ec972c5e6d2c6fd7ed0024d764191ec48ecR328-R332

Further to that I have concerns about the security of this solution. Take this with a grain of salt as there seem to be some bits missing and I am trying to speculate on the intended design, not the partial design I am seeing in the code. You are sending a static key to a potentially untrusted client. This static key is the only thing that prevents someone from running any scheduled task repeatedly. I cannot fully verify this since the JavaScript references a plugin that does not exist(?) so I'm unclear on how it will actually work.

There is no key for the lazyScheduler call which is triggered via Ajax, as the user can't target from here any task directly, so it's more a request: "Hey Joomla! run, if you want". The key is used for webcron, which is not send to the user. Ofc it's "visible" for supplier offering the webcron service, but not in general for site visitors.

[...] I'm not willing to waste my time again on something I know can't work

Fair enough.

At the very least I'd do the following:

* Document the limitations of lazy scheduling, setting the expectation that it can only be used for TRIVIAL tasks and MAY cause unintended side–effects on sites which are infrequently accessed, too busy or have caching beyond Joomla's own cache applied to them.

Yes, that we'll do.

* Get rid of the key and explain that security and lazy scheduling are incompatible notions so anyone who wants to use it should proceed with utmost care.

See commend above, no key there for the AJAX scheduler call.

Sorry for the long post, here's a potato. potato

Thanks again, was a good thing to discuss different areas.

nikosdion commented 3 years ago

Thank you for the detailed reply! I am very happy that you addressing all these concerns.

1, 2: Awesome! Not a lot of people will read the documentation but if it's there at least we can point them to it if they have questions about why component X doing something long and complicated doesn't use it or if they do end up having a timeout why it happens.

3 + 6 (locking): Perfect! Table locking does solve these problems.

3 (recovery): Thank you, that's a much better idea. In my experience the task recovery was the reason for a good third of the problems we were seeing.

5: Yes, it should be in a “Caveats” documentation page along with 1 + 2.

There is no key for the lazyScheduler call which is triggered via Ajax, as the user can't target from here any task directly, so it's more a request: "Hey Joomla! run, if you want". The key is used for webcron, which is not send to the user. Ofc it's "visible" for supplier offering the webcron service, but not in general for site visitors.

Ah! It makes much more sense now. That was the part I couldn't understand because the implementation of the AJAX call was referencing methods I could not find.

I guess the only major problem which could arise is simultaneous requests but you are addressing that with table locking so that should be dealt with. So, all good.

I'm not willing to waste my time again on something I know can't work

Fair enough.

In retrospect, there are some use cases for me. Definitely not Akeeba Backup but maybe in Akeeba Ticket System because I'm already implementing the “webcron” approach with custom code.

Just a question / feature request: will it be possible for the developer to say “this task DOES NOT support pseudo–CRON” (or, more generally, indicate which scheduling methods are supported)? If that's possible I could let the user schedule longer running tasks like taking backups or fetching email to create helpdesk tickets using the CLI scheduling. I had written a VERY rough prototype a couple of years ago which required a CRON job to run every few minutes. I put it to the side as I needed the time to improve my extensions for Joomla 4. What you're doing is more refined than my rough prototype and I'd rather use core Joomla code than resurrecting the prototype and reinventing the wheel. Of course that would only work for me if I could say that this backup task is only compatible with CLI CRON jobs, this email fetch task is compatible with webcron and CLI CRON jobs, this task for auto–closing old tickets is compatible with lazy scheduling, webcron and CLI CRON jobs. In other words, based on the expected duration of a work quantum and whether the task can be quantized I'd be able to disable the scheduling methods which wouldn't work with it. You get the idea? If not, I can explain further.

After this discussion I am feeling much more positive about this feature!

bembelimen commented 3 years ago

Hello Nicholas,

Just a question / feature request: will it be possible for the developer to say “this task DOES NOT support pseudo–CRON”

Not (yet), but I really like the idea. At the very beginning @ditsuke implemented a "trigger" field to define how the task should (could) be executed as a parameter the user itself had to set. But I suggested to remove it then, as it was hard to explain. As we have a few days left, let me see what I can do regarding your suggestion.

nikosdion commented 3 years ago

Awesome! Thank you very much! This has the potential to be a very powerful feature for power users like me and many of my clients :)

bembelimen commented 2 years ago

Just a little update regarding the CLI only function: https://docs.joomla.org/J4.x:Task_Scheduler

nikosdion commented 2 years ago

@bembelimen Perfect! This is super easy to understand. Thank you!