processwire / processwire-issues

ProcessWire issue reports.
45 stars 2 forks source link

Request URL contains invalid/unsupported characters with UTF-8 page names #1718

Closed gebeer closed 1 year ago

gebeer commented 1 year ago

Short description of the issue

On a multilang install with $config->pageNameCharset = 'UTF8'; I get a 404 when navigating to an URI like /zh/asia/家/

Expected behavior

The page at that URI should be rendered

Actual behavior

The URI results in a 404

Optional: Suggestion for a possible fix

I've been able to track this down to https://github.com/processwire/processwire/blob/6ff498f503db118d5b6c190b35bd937b38b80a77/wire/core/PagesRequest.php#L534 the REQUEST_URI is not decoded. An URL encoded string like %E5%AE%B6 is coming from the browser. In L553 $it then results in E5AEB6 In L562 this is passed to $sanitizer->pagePathNameUTF8() and results in E5AEB6 This results in response code 400 in L566

The page gets displayed when I change L534 to read

            $shit = trim(urldecode($_SERVER['REQUEST_URI']));

See PR https://github.com/processwire/processwire/pull/268 for the suggested solution

Steps to reproduce the issue

  1. Step install PW 3.0.214 with multilang profile
  2. in site/config.php set $config->pageNameCharset = 'UTF8'; and $config->pageNameWhitelist = '';
  3. create language Chinese with name "zh"
  4. create a page under Home with name "asia"
  5. create a child page under Asia with English name 'home', Chinese name '家'
  6. view the page at /zh/asia/家/

Setup/Environment

ryancramerdesign commented 1 year ago

@gebeer I can't duplicate this one. ProcessWire shouldn't be using the REQUEST_URI unless there's something potentially incorrect in your .htaccess file. Can you check your .htaccess file, sections 16A and 16B. Chances are you'll want 16B (and update it as needed) or you may want to comment out both 16A and 16B if you are allowing any characters in your URLs (since you are setting your pageNameWhitelist to blank).

gebeer commented 1 year ago

@ryancramerdesign I can confirm that commenting out 16A and 16B in .htaccess solves the issue on a clean PW install with the multilang profile.

But on the install where I discovered the issue, it still persists, even after commenting out 16A and 16B. I will have to investigate what other code in that install might be causing this.

Will close for now.

gebeer commented 1 year ago

Another question. You said:

ProcessWire shouldn't be using the REQUEST_URI unless there's something potentially incorrect in your .htaccess file.

Do I understand correctly, that REQUEST_URI is not used by PW at all?

We have other working code in that install that relies on REQUEST_URI which means it is available in $_SERVER.

ryancramerdesign commented 1 year ago

@gebeer ProcessWire will use $_SERVER[REQUEST_URI] as a backup, if the .htaccess didn't pass through the URL, like if something is broken or missing in the .htaccess file. What PW uses instead is the REQUEST_URI from Apache (passed through as $_GET[it]), which stays UTF-8 and doesn't get percent/url-encoded in instances I've observed, whereas $_SERVER[REQUEST_URI] does. For the case you are seeing things not work, I would probably check that your .htaccess still has the AddDefaultCharset UTF-8 and that the .htaccess file itself is UTF-8 encoded. If neither of those are it, it might be interesting to open the browser dev tools network tab, click on the main request, view the headers, and see if it has a content-type: text/html; charset=utf-8 header or if it is something else. I'd be curious what you find. Thanks.

gebeer commented 1 year ago

Thanks @ryancramerdesign for the detailed explanations. After investigating this further, I noticed that none of the .htaccess rules got applied in my local ddev setup. Then I noticed that ddev was setup with nginx-fpm instead of apache-fpm. That explains everything. It is embarrasing and I apologize for the noise. I wonder if you or the community could provide all the .htaccess rules as nginx directives. Would be quite useful since a lot of people seem to move from apache to nginx. I will raise this in the forum.

teppokoivula commented 1 year ago

I wonder if you or the community could provide all the .htaccess rules as nginx directives. Would be quite useful since a lot of people seem to move from apache to nginx. I will raise this in the forum.

There are various versions floating around the forum, but it would be nice to have an official version as well.

That being said: even though I use nginx pretty much everywhere nowadays — it's generally speaking easier to work with, though there are caveats, such as lack of separate per-directory conf files — I ran some tests while migrating weekly.pw to a new server a while back and found that Apache performed noticeably better in that context.

Not what I had expected, and it might've had something to do with my specific config. I used one of the community rulesets with a few modifications. Anyway, ended up ditching nginx and going back to Apache for the time being :)

gebeer commented 1 year ago

I wonder if you or the community could provide all the .htaccess rules as nginx directives. Would be quite useful since a lot of people seem to move from apache to nginx. I will raise this in the forum.

There are various versions floating around the forum, but it would be nice to have an official version as well.

That being said: even though I use nginx pretty much everywhere nowadays — it's generally speaking easier to work with, though there are caveats, such as lack of separate per-directory conf files — I ran some tests while migrating weekly.pw to a new server a while back and found that Apache performed noticeably better in that context.

Not what I had expected, and it might've had something to do with my specific config. I used one of the community rulesets with a few modifications. Anyway, ended up ditching nginx and going back to Apache for the time being :)

Thanks for the input. Sometimes we as devs might not be able to decide that. So official config would be great. EDIT: that would be a canditate for a feature issue or wishlist/roadmap.

@LostKobrakai 's gist seems to be the most up to date community version: https://gist.github.com/LostKobrakai/b895e2e0e8a2c14b4da88cc7e16cf954

netcarver commented 1 year ago

Perhaps the official .htaccess file could set an environment variable, say PW_USING_HTACCESS, and PW's code could then check if it's set using getenv() and at least warn about possible .htaccess misconfigurations if it is not set? Maybe have a way to skip the check by a setting in config.php in case systems are intentionally not using .htaccess.

.htaccess

<IfModule mod_env.c>
    SetEnv PW_USING_HTACCESS 1
</IfModule>

site/config.php

// skipHtaccessCheck defaults to false. Uncomment the following if you are not using Apache .htaccess files.
//$config->skipHtaccessCheck = true;

PW bootstrap...

if (!$config->skipHtaccessCheck && !getenv('PW_USING_HTACCESS') {
    // Issue warning to admin user
}

This would, of course, require mod_env to be installed for apache.

ryancramerdesign commented 1 year ago

@teppokoivula @gebeer An official nginx equivalent of the .htaccess file would be nice to have, no doubt. But I don't know enough about nginx in order to code or maintain it. The details of the .htaccess file are relatively important in maintaining the flow of what gets sent to ProcessWire, and also what is blocked, so it's one of those things that would require someone really up-to-speed with both nginx and apache to code and maintain properly. Though we change the .htaccess so rarely that it's probably easy to maintain long term, once there. If there's someone that fits that criteria that's willing to do that, I'd be happy to have our docs point people to it, or whatever makes it "official".

@netcarver We are already setting an environment variable in section 8B of our .htaccess file. Maybe this could also be used for the purpose you mentioned? The reason we have it currently is so that PW can detect if mod_rewrite is working properly during installation. The reason I think we'd want to use something that's already there in the .htaccess is that it wouldn't require one have the latest version of the .htaccess file in order to perform this check, avoiding lots of false positives.

  <IfModule mod_env.c>
    SetEnv HTTP_MOD_REWRITE On
  </IfModule>
netcarver commented 1 year ago

@ryancramerdesign Yes, that makes sense.