processwire / processwire-issues

ProcessWire issue reports.
45 stars 2 forks source link

More than one period in the URL leads to 404 #1807

Open flipzoom opened 1 year ago

flipzoom commented 1 year ago

Short description of the issue

If you use more than one dot (period '.') in the URL of a page, the page cannot be displayed and ends in a 404. No matter at which position the period occurs.

Steps to reproduce the issue

  1. Add more than one dot somewhere in the URL of a page, e.g. "/my-wonderful.test.page".
  2. Try to visit this URL > 404

Setup/Environment

flipzoom commented 1 year ago

The bug also exists in older versions, currently tested on:

flipzoom commented 1 year ago

Okay, found the problem, it's not not a bug in PW core but in ProCache (Version 4.0.4). When using method C in Buster the following HTACCESS rule is added.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.+)\.([a-zA-Z0-9]+)\.([^.]+)$ $1.$3 [L]

This causes the 404 on multiple periods.

flipzoom commented 1 year ago

@ryancramerdesign, I think there is only one workable solution here, the "([^.]+)" group is the problem because it simply checks to see if there is no period. Even if you search only for alphanumeric characters of a file extension there are false positives.

screenshot-regex101 com-2023 09 06-11_58_35

The only solution I see is to include all possible file extensions in the Rule, but that's a bottomless pit. I could also just link a file e.g. ".XLS" as download in the editor and the TextformatterBuster changes the URL. But since XLS would not be configured in the HTACCESS at that time, it would not work.

^(.+)\.([a-zA-Z0-9]+)\.(jpe?g|gif|png|ico|webp|svg|avif|pdf|AND-SO-ON)$

Do you have any other idea?

flipzoom commented 1 year ago

Perhaps a unique identifier of versioning. E.g. "pcb-[RANDOM-VERSION]". PCB for ProCacheBuster. Similar to the other methods where "v=" is appended, the rule could look like this:

^(.+)\.pcb-([a-zA-Z0-9]+)\.([^.]+)$

screenshot-regex101 com-2023 09 06-12_07_55

ryancramerdesign commented 1 year ago

@FlipZoomMedia Thanks, I wasn't aware of that issue with the Buster option C. From what I understand, the benefits of the C option (better client or proxy caching) are no longer the case, so maybe the option C isn't really even needed anymore, though subjectively I still like how it presents. I don't often see people use periods in the page name, and this may be the first time I've seen multiple periods in the page name. But since it's valid for page names, and there is no straightforward way to get around the issue short of making option C have some common token, or predefining the file extensions (and hoping that neither will match a page name), maybe putting an asterisk next to it with an explanation of this limitation is the best route. What do you think?

flipzoom commented 1 year ago

Hi @ryancramerdesign,

I've come across this currently, as two projects I maintain have German date format in the URLs. E.g.: /01.01.1970-test-album/.

Option C with versioning in the filename is actually industry standard. Even large CDNs like KeyCDN advise against the query string method. I personally find method C the nicer option as well.

Basically, I would store an info about the problems in the module, but still provide variant C with a "signature token", such as "pcb-[VERSION-NUMBER]". This does not completely avoid the false positive, but it minimizes it many times over. Here you could also adjust the regex for backward compatibility if there are still old paths somewhere in text fields and make the "pcb-" optional.

^(.+)\.(pcb-)?([a-zA-Z0-9]+)\.([^.]+)$

Test: https://regex101.com/r/1HJQ1X/1

^(.+)\.pcb-([a-zA-Z0-9]+)\.([^.]+)$

Test: https://regex101.com/r/NnOABs/1

The preconfiguration of allowed extensions I would not consider as a solution, that is a barrel without Bode, too much administrative effort.

Another alternative would be an option D that does a path versioning, e.g.: with a unique identifier at the beginning of the asset URL. /pcb-[VERSION-NUMBER]/sites/assets/...

^\/pcb-([a-zA-Z0-9]+)\/(.+)\.([^.]+)$

Test: https://regex101.com/r/bWUfZw/1

Wouldn't it be possible here then to use a hook to check the page name and consider /pcb- as reserved if this is at the beginning of the page name?

Last alternative: if Mehtode C is active, the periods are forbidden in the configuration of InputfieldPageName, .=-. But this does not change anything for already existing paths but only for new ones.

So I personally would prefer the variation to extend method C by a unique identifier, even if a HTACCESS adjustment is necessary here afterwards. But this variation minimizes the scenario the most in my opinion.