PagePaths module breaks being able to implement `path()` on a Page Class

Short description of the issue

If a Page class implements the public function path() :string method to return a different url for the page, the PagePaths module will, in certain situations, cause the page not to be loadable by its URL.

Expected behavior

Be able to load a page by visiting $page->url

Actual behavior

Get 404

Specifics

The PagePaths module has two ways to generate the URL.

The correct way, using the pages API ($path = $page->path()). This is used in updatePagePaths()
A shortcut way that imposes an assumption that a page's URL is it's parent page's URL + /child-page-name. This is used in updatePagePathsChildren()

This causes a really confusing situation: edit + save a page (that implements path()) and its URL is likely reachable. Edit + save a parent page and all its children go 404.

A possible fix

ReplaceupdatePagePathsChildren() with

protected function updatePagePathsChildren($pageId, array $paths) {
  $numUpdated = 0;
  foreach ($this->pages->findMany("parent=$pageId,include=all") as $childPage) {
    $numUpdated += $this->updatePagePaths($childPage);
  }
  return $numUpdated;
}

Steps to reproduce the issue

Create a template called Blah. Configure family so this page can be used as a child to some other tpls.

Create a Page Class file for Blah and implement

public function path() {
   return '/some-path-does-not-match-parent/' . $this->id;
}

Create a template file for the page, can just echo "hi".
Create a child page of some existing page using this template.
Save +View the page: it should show the /some-path-does-not-match-parent/123 path and work ok
Make a change that would cause PagePaths to regenerate paths for child pages. Alternatively just do modules('PagePaths')->rebuild();
Now try to visit the page at its URL: 404

Setup/Environment

Server Details

Software	Version
ProcessWire	3.0.229
PHP	8.1.23
Webserver	nginx/1.22.1
MySQL Server	10.11.6-MariaDB-0+deb12u1
MySQL Client	mysqlnd 8.1.23

Server Settings

|Parameter|Value| |------:|:-------| |allow_url_fopen|1| |max_execution_time|5 (changeable)| |max_input_nesting_level|64| |max_input_time|60| |max_input_vars|1000| |memory_limit|128M| |post_max_size|50M| |upload_max_filesize|20M| |xdebug|| |xdebug.max_nesting_level|256| |mod_rewrite|| |mod_security|| |EXIF Support|1| |FreeType|1|

GD Settings

|Parameter|Value| |------:|:-------| |Version|2.3.3| |GIF|1| |JPG|1| |PNG|1| |WebP|1|

iMagick Settings

|Parameter|Value| |------:|:-------| |Version|6.9.11| |GIF|1| |JPG|1| |PNG|1| |SVG|1| |PDF|1| |WebP|1|

Module Details

|Module ClassName|Version| |------:|:-------| |AddImageUrls|0.3.0| |EditablePublishedDate|0.1.0| |FieldtypeDynamicOptions|0.1.7| |FieldtypePageTableNext|2.0.3| |FileValidatorSvgSanitizer|0.0.5| |InputfieldPageTableNext|2.0.3| |MarkupPnpNav|0.1.0| |MarkupSocialLinks|0.0.1| |PageListImageLabel|2.0.1| |PageTableNext|2.0.3| |ProcessMenuCreator|0.1.0| |ProcessPnp|0.0.1| |ProcessSwitchUser|0.0.6| |ProcessTagHerder|0.0.1| |ProcessWirePublishDate|1.2.0| |ProcessWireUpgrade|0.1.1| |ProcessWireUpgradeCheck|0.0.9| |Sifter|0.1.0| |TextformatterInlay|1.0.0| |TextformatterTweetEmbed|0.1.0| |TextformatterVideoEmbed|2.0.2| |TextformatterVideoEmbedOptions|0.4.0| |TracyDebugger|4.25.11|

@artfulrobot I think what's missing here is a handler for your custom URLs. You can override the path method (and you should override __path() not path()) but that is just overriding the URL at runtime, so ProcessWire isn't going to know that /hello/world/ should load /some/page/ unless you have something to detect /hello/world/ and then tell ProcessWire to serve /some/page/. So technically /hello/world/ shouldn't work at all in this situation. My best guess is that maybe it got cached in PagePaths and appeared to work temporarily. But PagePaths is essentially a cache, and not the source of URLs. So in addition to your custom ___path() method, you would need this in your /site/init.php file:

$wire->addHook('/hello/world/', function($event) {
  return $event->pages->get('/some/page/'); 
});

That's a really simple example. See the dedicated URL/path hooks post for more details.

An alternative to URL/path hooks is to enable URL segments on your 'home' template. Then you can capture those URL segments like this by editing your /site/templates/home.php template and adding a handler at the top:

if($input->urlSegmentStr === 'hello/world') {
  // url segment string that we recognize 
  $page = $pages->get('/some/page/');
  echo $page->render();
  return $this->halt();

} else if($input->urlSegmentStr) {
  // some other URL segments that we do not recognize
  wire404();
}

// serve regular homepage content here

@ryancramerdesign thanks for your detailed response.

I might have missed something - I've been round the houses on this one, it's been a long journey

The problem with the URL hook (which is what I tried first), is that the URL picker doesn't (can't) understand it. So the wrong (i.e. internal) links get generated while the URLs we want cannot be found. Now, the internal generated links could work with a 302 redirect, but that generates loads of 302s, which is not great for anyone.

The problem with url segments is that the link QA doesn't work - it sees that the tpl handles segments and it gives up checking, assuming it's fine.

It seems there's some places where pagepaths is used and some where it's not.

I've found the above fix to PagePaths to work ok: the cache is populated with the correct (desired) URLs; the pages give the desired URLs from $page->url; the URLs work to load the page; the URLs can be searched; the link QA works.

I'll try swapping paths() for ___paths() though.

@artfulrobot What do you mean by the URL picker that doesn't work? I'm not sure I know specifically what you are talking about.

This ProcessWire powered site uses custom URLs like described above: https://www.biketours.com ... pretty much none of the URLs on the site match their page tree in the admin. This was so that the URLs could be maintained from the former site, which wasn't running ProcessWire. But we wanted better organization in the admin side, even if not reflected in the URLs. I just checked and looks like we are using MarkupQA on some fields and not others. Though we have PagePathHistory installed so I'm not sure it matters.

Regarding the updatePagePathsChildren() method, my mind isn't in that code enough to grasp it at the moment, but if I recall correctly, this was a case where it needed something that went straight to the database to be practical with the quantities it could potentially deal with. There are some installations out there with hundreds of thousands or millions of pages, where even findMany() might not be practical. But your version sure is nicer to look at. Maybe I could have it count ahead of time and decide what to use. Or could make the method hookable, so that you could easily replace the logic of the method with your own, and have it continue to work through version upgrades.

With PagePaths not intended to be the authoritative source of URLs, I'm not sure exactly how it's all working in your case, or how fragile it might be across different situations. But if it all seems to work without issue, then that's good to hear, perhaps it's a safer solution than I originally guessed.

What do you mean by the URL picker ...

I am referring to this widget that lets you search pages by URL

With my set up:

rendering a page object's url property gives the desired URL. (This is important for SEO and having human-readable URLs.)
requesting the desired URL in a browser loads the page, via PagesLoader using PagePaths table
the URL picker (screenshot ↑) is able to search and suggest pages by the desired URL
the MarkupQA does not report a broken link when the link is valid
the MarkupQA does report a broken link when the link is invalid
the page is found with pages('the-desired-url') (via PagesPathFinder)
the page is not found with pages('the/system-url')
I can change the URL and the (very impressive) MarkupQA updates the URL of saved pages to the new one; and all the above still works with the new URL.

All this is good!

Implementing a URL/path hook would mean that it would need another source of desired-urls-to-pages lookup; Perhaps on a hidden field on all pages, or in a special db table. And I don't think URL hooks are consulted on tools like the MarkupQA checker (e.g. you code calls halt() - wouldn't want that while checking a URL!): so it (a) won't offer the desired URLs and (b) will complain that the URL doesn't exist when it does.

But I hear your concern (I'm hearing "you're making it work a different way to how it was designed to work!").

Re your bikers website, it's hard to tell from the front end what's supposed to be what. e.g. I note that when on
https://www.biketours.com/how-we-help/our-team/
I might expect the liv-gray page to be:
https://www.biketours.com/how-we-help/our-team/liv-gray/
but it's actually at
https://www.biketours.com/info/how-we-help/our-team/liv-gray/

I realise this may be deliberate.

processwire / processwire-issues