Harden background job handling

bekarice commented 7 years ago

In our utilities classes, we commonly run into a couple issues:

Object caching overrides that improperly handle options (don't purge cache when an option is updated, so job progress is incorrect) -- this can cause duplicated rows in the export file, or extend the time for processing an import.
Hosts that block or rate-limit POST requests to the admin-ajax endpoint. This will cause the job to remain queued, as the maybe_handle() method in the async request class doesn't fire right away or at all.

To resolve these, we should:

[x] Use direct SQL queries for setting / getting job status because some hosts don't properly clear the option cache when requested. This will bypass caching.
[x] Investigate alternatives for dispatching a job to move it to processing that don't involve a POST request to the admin-ajax endpoint.
- [ ] With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening.

Related issues

https://github.com/skyverge/wc-plugins/issues/1758
https://github.com/skyverge/wc-plugins/issues/1726
(CSV Import will need an issue for this too)

Related threads

ZD 527590
ZD 544840
ZD 544490
A few others that weren't tagged on our end.

bekarice commented 7 years ago

updated OP with notes to check that server hostname is properly configured

ragulka commented 7 years ago

Hosts that block or rate-limit POST requests to the admin-ajax endpoint. This will cause the job to remain queued, as the maybe_handle() method in the async request class doesn't fire right away or at all.

@bekarice wait... do they only limit POST requests? Could we get away by simply using wp_safe_remote_get() instead of wp_safe_remote_post()? Or am I missing something? They both support non-blocking cURL requests. I just tested this locally and the async part worked flawlessly.

Also, just found this article covering a few alternatives we could look into: https://segment.com/blog/how-to-make-async-requests-in-php/

With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening

I'm curious how this happens? When we wp_remote_post(), we'll simply use admin_url() to create the endpoint to request. This, in turn uses get_option( 'siteurl' ) to get the base url from the database. The siteurl option is set during install, using wp_guess_url(). So if my assumptions are correct, then the site should not really function at all for that customer, since WP so heavily relies on the siteurl option being correctly set. However, it might be that I misunderstood the whole issue.

ragulka commented 7 years ago

Use direct SQL queries for setting / getting job status because some hosts don't properly clear the option cache when requested. This will bypass caching.

@bekarice do you think we simply want to have a simple SQL query or should we create our own versions of add/update/get_option functions that also trigger all the hooks and filters that are normally run?

EDIT: thinking more about this - I'm not sure how the Object Cache in GoDaddy/WPE works (do they supply their own version of WP_Object_Cache?), but would it be feasible to invalidate/flush the cache before add/update/get_option? For example:

wp_cache_delete( "{$this->identifier}_job_{$id}" );
$results = get_option( "{$this->identifier}_job_{$id}" );

Or am I completely off track here?

maxrice commented 7 years ago

AFAIK GoDaddy/WPE (and probably other hosts) provide their own version of the object cache so they can use other persistence layers like Redis, memcached, etc. When we were troubleshooting the tickets with these issues, I think we tried clearing the cache but it didn't have any effect.

For the purposes of a job processing system, you really don't want any caching involved at all. I think we'd want simple SQL queries without any hooks or filters.

re: the rate-limited POST requests, I don't know that there is a good solution. A GET requested might work better, but I suspect it's more likely to be cached, which is bad. I think what we would benefit from here is a health check that confirms the server can make POST requests to it's own admin-ajax.php endpoint, so that we have an easy way when doing support to know what's going wrong.

ragulka commented 7 years ago

re: the rate-limited POST requests, I don't know that there is a good solution. A GET requested might work better, but I suspect it's more likely to be cached, which is bad. I think what we would benefit from here is a health check that confirms the server can make POST requests to it's own admin-ajax.php endpoint, so that we have an easy way when doing support to know what's going wrong.

Hm, you're probably right about GET requests possibly being cached. But could we use cache-busting? If the URL is different each time, then the it should not be cached, right? For example, we can simply append a md5 of the job id and current timestamp to the url, something like this:

http://example.com/wp-admin/admin-ajax.php?action=run_background_process&stamp=jdshh8272318jsadas847

Given that query-strings are not cached on some CDNs/proxies, it should work in our favour. What do you think?

maxrice commented 7 years ago

Great point, we could definitely add a timestamp to the URLs which should work. I'm still not sure if it's the HTTP method that's the real issue with these sites or other weirdness (like the one where the server's DNS was busted). It stands to reason that if a hosting provider is blocking or rate limiting POST requests to the admin-ajax endpoint, they'd probably do that for any type? Before we make this kind of change, I'd like to be able to test on a server that has this specific issue, so we can see if changing the method makes any difference. @bekarice any recent support tickets that we could use here?

bekarice commented 7 years ago

@maxrice @ragulka ZD 544490 has picked back up again actually, that's with WPE so that's probably the best candidate since that seems to be the most problematic with rate limiting.

ragulka commented 7 years ago

@bekarice I've patched CSV Export with direct SQL queries and GET requests in https://github.com/skyverge/wc-plugins/pull/2110 - can this be tested on the customer's site to see if it fixes the issue for them?

bekarice commented 7 years ago

@ragulka getting FTP creds for that site -- WPE says object caching is off, but there's some duplication happening. will loop you into that thread once I have it.

maxrice commented 7 years ago

Following up on this with ZD 568347 -- this host blocks so-called "loopback connections" where the server is attempting to connect to itself, responding to both GET and POST requests with HTTP 403. Obviously our async process never starts, and exports are left in the queued state with only the column headers present.

Amusingly (and I guess expected), if you copy the generated URL that the dispatch method tries to connect to, and visit that in the browser, that kicks off all the background jobs.

I'd say that we need

1) a big red admin notice that indicates to the admin that their hosting provider is blocking loopback connections. Ideally we'll check this on plugin install/upgrade, and provide a system status tool for good measure. 2) A potential workaround where we make client-side Ajax requests to the dispatch URL in order to get the exports to process. Since they won't appear to be coming from the server, there shouldn't be any issues with these lame hosting providers in that case. The obviously caveat here is that the user won't be able to leave the page when the background job is in-progress, or else processing will stop.

bekarice commented 7 years ago

I'm not a huge fan of option 2, but I guess it's at least a workaround to use as a fallback if we detect that the host won't accept this connection. We'd just want to keep this in the job notice + admin notice so people know it will function differently for them, and you'd have to have a js alert likely if they try to leave the page.

Would we need to fail gracefully on the job somehow if they do (ie delete it or just stop progress on the current row)?

I think with Max's points and this

With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening

we could likely close this out.

cc @mikejolley who may be interested in a couple of the above changes (I think $wpdb vs update_option() / get_option() in #209 has helped at least, but still a bit early to tell if it's solved the majority of caching issues).

ragulka commented 7 years ago

I think a fallback js version with an alert when trying to leave the page is a fine workaround. Although even doing so we should really encourage merchants to contact their hosts to enable loopback connections.

bekarice commented 7 years ago

one limitation to consider: auto-exports probably couldn't use a js / client side request system since it would require the browser to fire requests, so this really would only be a workaround for manual exports. with that in mind, do we think it would still be worthwhile to pursue, or just require people to use a reasonable host?

godaddy-wordpress / wc-plugin-framework

Harden background job handling #195

Related issues

Related threads