vendi-advertising / vendi-cache-2

0 stars 0 forks source link

How to visualize why URLs weren't cached #7

Open cjhaas opened 7 years ago

cjhaas commented 7 years ago

(We've been going back and forth on email with the La Crosse geek group but I decided to move this here to make it more public.)

Background

The previous version of Vendi Cache traps all errors and exceptions using set_error_handler and set_exception_handler however we aren't specifying what error levels to trap. Because of this, anything that throws anything including just E_DEPRECATED or E_USER_WARNING will automatically disable the cache for this request.

That last part is very important. The cache gets disabled only for that specific request that triggered the error, this is not site-wide. However, if whatever triggered that error is global-level and gets called on every request, every request will raise an exception which effectively means that caching is disabled. I don't think I can stress this enough, Vendi Cache isn't disabled, we're just being too sensitive and someone else's :poop: code is blocking us.

99% of the time the problem is blind array access which throws E_NOTICE fs the index doesn't exist. Something like:

$id = $_GET[ 'id' ];

Depending on your error reporting level these errors might actually be outputted to the HTML stream which is why the guard was added in the first place.

The Problem

Because of the above, other plugins or even your theme could effectively disable caching for every page and there's really no way for you to know about this. If you have the hidden debugging setting turned on (off by default but I strongly recommend leaving it on always, there's no harm in it) then at least you can occasionally "View Source" and look for it but you'd really need to do it per-page which isn't practical.

What we really want to add is some sort of logging, possibly even an alert, to let admins know that URLs aren't caching. We could write to wp_options but that could grow really fast and would require database hits. We could create our own database table but then we're going to be locked out of certain hosts that don't support non-core tables and also, as a host myself I find this very annoying. We could also write to a log on disk but this file could grow very fast on large sites. It'd be really nice to be able to mark actual pieces of content as cached or not cached but unfortunately we deal with URLs and not posts, and the way that we store the files on disk for compatibility with Apache/Nginx makes this pretty impractical.

The Plan

So here's my plan.

  1. Logging will be disabled by default This seems like a sane default because it doesn't compromise resources (database, disk, memory, CPU) for other people's poor code.

  2. HTML comment debug message on by default The HTML comment that we inject into the bottom is innocuous and we've never seen or heard of a problem with it. Since logging is off by default this will at least help people troubleshoot from the start.

  3. Logging can be turned on in two ways, either database and/or file For database logging we'll create our table only at this point. We'll also allow something like a constant or environment variable to disable this option for some hosts. We'll probably have some options like "Log n number of non-cached requests" and include some pruning logic.
    For file logging we'll (maybe) allow specifying the log file. (We might need to generate it non-deterministically for security reasons, need to think on that still.) Doing circular logging efficiently isn't easy (AFAIK) so logging only the last n requests might not be an option. Capping at a MB limit isn't really practical either because that might only tell you what caching wasn't working last week, not today.

  4. Logging can be turned on per CacheBypass rule We've broken the reasons to disable caching for a given request into their own separate logic. As you can see, there's many valid reasons for a request to not be cached, some expected but some not too obvious. We might allow some sort of checkbox UI with sane defaults that allows you to opt into logging for each rule.
    Remember, we wake up for every request that your WordPress site sees, not just pages and posts. If your site is working 100% fine but is in maintenance mode, we obviously don't want to cache that. If you are logged into a page, we also don't want to cache that because that would give non-admin users access to your admin bar. They wouldn't have the proper cookies so there's no security problem, but it would look weird. Also, for idempotent reasons we only want to cache GET requests, nothing else.

Next steps

I'd love people to weigh in on "The Plan". Will that work? Will that support both hosting requirements as well as site administrators? Besides being the plugin author I'm both of these roles, too, so I think this is the best possible route.

Does anyone know an efficient way to do circular logging? I can do it by prepending a file and truncating it hard at a byte count but that seems ugly. We're using Monolog and they have a RotatingFileHandler but that's not exactly what I want. Mostly, I don't want to use up someone's disk because a rogue spider is hitting their calendar and visiting every day to the year 9999. Yes, I've seen this happen.

Alerts?

This is TBD and if it every happens it won't be worked on until logging is handled