Closed brendanheywood closed 12 years ago
As far as I can see there are two main benefits to this, SEO and other websites are more likely to link to it with a human readable url (again SEO).
For this reason I would favor getting to the
http://www.thecrag.com/arapiles
style url for major world crags.
Using the scheme above then arapiles just becomes a 'level 1' canonised, where as gara gorge might only ever become a 'level 2' canonised urls and remain as /australia/gara-gorge/. Either way the scheme handles it and this is just a data entry thing.
google webmaster docs http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf
On page 8-10 they touch on this briefly but either of the schemes would be fine. For me the question is about does the /australia/ path add value and does it take anything away.
http://www.stephanspencer.com/matt-cutts-interview/
Here Matt talks about the number of words in a url being favoured by both search engines and users when less than 5. So both options look good under this question.
Across the board this seems like a clear victory for '-' as the preferred delimiter. No brainer
/arapiles
Pros:
Cons:
/australia/arapiles
Pros:
Cons:
So both are good options, I'd prefer the latter as the default but still accept the first if it was typed in. I'm happy either way. Both are a major step up from where we are. Campbell?
Adam has reiterated how important this is for SEO. From his experience this is now one of the biggest things in getting your search results up. I will try and prepare the back end for next release.
Some implementation notes.
If I understand things correctly we need to add a URLSlug table to the db with fields something like:
Arapiles may have 4 records: 11740915,australia/vic/north-west/arapiles,0,0 11740915,australia/vic/arapiles,0,0 11740915,australia/arapiles,1,0 11740915,arapiles,0,0
So each of these when present in the url will map to arapiles, but internally we will know which is the cannonised version by the flag. If we want to keep the slug but return not found then we can set the not found flag.
Theoretically this could work to any level in the index, but stopping at top level crag seems as good as any. Actually maybe we should stop at bottom level crag to capture say 'summerday valley in grampians (ie 'australia/vic/north-west/grampians/north-grampians/summerday-valley' plus other variants).
An administrator should have additional permissions to edit these.
We can write a process to initialise slugs based on number of routes in top level crags. Therein manual management.
Merging will combine tags and with the cannonised flag being reset for the merged entry
Reparenting will add an additional slugs keeping the old ones.
Adding a new crag will automatically create a slug if it is top level crag/ bottom level crag. Multiple url slugs need to be managed manually.
Changing the area type does nothing, you will have to manage things manually.
Along with the node id I will return the cannonised url slug.
Some template changes to get internal linkages working.
We will have to make a list of reserved words that cannot be used in url slug segments (eg route or json).
What about using the word 'climbing' in the first part of the url as this is a big google search term and will also be a trigger for the reverse slug lookup. For example:
www.thecrag.com/climbing/arapiles
or even
www.thecrag.com/rock-climbing/arapiles
Use the text between the reserved words for the lookup
What happens if you use The Bard's route id for a url slug referencing the blue mountains.
Need to transition slowly so the old urls should continue to work. Another reason why I would like to put the word 'climbing' in.
A lot of this is tied up with your url routing code which I don't know anything about so can't comment. I would err on the side of making it simpler, we are only ever going to link to the canonical url so I don't think tracking all the others add much value. I also don't think tracking ones that are disabled add much value either.
I'm in two minds about adding 'climbing' into the url. It would only be significant for google if we made it the canonical url and this is what everyone saw for every page. When I first saw it I kinda hated the idea but I'm warming to it. One thing that doesn't quite sit well with me is the different climbing styles. eg this doesn't make sense:
/rock-climbing/canada/banff
For a google search these might make more sense and be more relevant:
/bouldering/australia/gara-gorge /ice-climbing/canada/banff
I think 'climbing' covers all of these but 'rock-climbing' doesn't. If I looked at these urls I'd expect them to be filters like the facet urls. So basically I'd stick to just 'climbing'.
For reserved words, absolutely. There was a great github article about this (I can't find it) which showed their list of reserved projects names that they'd kept up their sleeves for the future, like 'donate', 'book', 'blog', This probably doesn't matter as much as we are in control of all the top level anyways.
For reverse lookup, don't really understand what you mean.
For inconsistent route ids, I don't think it really matters. I'd be happy to throw a 404 and stop processing. #121. This should only happen if someone goes out of their way to craft a url, or if a route gets moved between crags.
Keep old id's working - yes absolutely. The way I see it progressing is:
I can see this happening over 1, 2 or even 3 releases. 3 releases is probably the safest.
Possibly way down the track try and wean people off the id version by serving permanent redirects instead of internal forwarding.
Just my 2cents, i think that the /climbing/ in the URL is a great idea.
when i search i am typing in "climbing switzerland zug" or "climbing guide switzerland..."
so the climbing will strengthen the search IMO
Yeah lets do it:
/climbing/australia/arapiles
I have done an automated mapping script with the following logic:
Regions above country (ie continent) => /continent Country => /country Regions deeper than country (ie state) => /country/state TLC => /country/tlc Crags deeper than TLC => /country/tlc/crag
I have avoided clashes by concatenating parent name so for clashes we get one of the following: /country/parent-tlc /country/tlc/parent-crag
Which pretty much avoids all the clashes (except for about 3, which I have done some manual changes to the index because they were dups).
This gives us about 11k url slugs. Once this is done I think we should set up robots.txt so that google will only index these 11k urls. We currently have a load of over 200k urls that google is indexing. Making this smaller will reduce the background load on the site. So google will only index the following clean urls from the index: www.thecrag.com/climbing/australia/arapiles ... (ie anything with /route/ or /area/ are not indexed).
I also propose that where we have the following url which references a climb/area id like www.thecrag.com/climbing/australia/arapiles/route/1234 www.thecrag.com/climbing/australia/arapiles/area/6789 we save some cpu cycles and ignore the stub and just process the 1234 id. The page can return the canonical version of the url if the stub is wrong. This avoids the problem of how to deal with errors referencing a blue mountains route in arapiles and also avoids the problem of routes being reparented from one crag to another with third party sites referencing the url from the earlier crag.
I think the maximum URL stub limit should be 80 characters. I did an experiment with 50 and that was too short for a lot of cases in the automated mapping script.
Cool.
I think we should err on the side of caution with google. If google indexes us completely roughly once a month then should allow plenty of time for it to index us properly with the new urls and remap all of the old ones in it's internal db.
The worst case scenario is if we tell it to not index /area and /route, and google doesn't yet know about all the new stuff and as a result all our pages just don't show up in searches any more.
Thats why I think we should release with the new urls in place, but with their canonical url's pointing to the current ones. Then where we're happy, change the canonical url in the html head tags to the new urls and rewrite all the internal links to the new urls, then release and give google a good month or so to index and remap all that. And then and only then once we know google is correctly using the new urls in search results, then we can mess about with the robots and remove old url's.
There are also more robust ways of dealing with googlebot instead of robots.txt. Once we are fully confidant with the new scheme we'd serve permanent redirects from the old /area pages to the new slugs, google will pick them up and after that never visit the old url's again.
One things that would go an awfully long way is fixing the ways the cids cgi script works so that after it has finished a process it returns to the correct url instead of a gobbledgook one with lots of params. It is these urls that people cut and past into forums which break because they have session / user specific stuff in them, it messes up google analytics because it doesn't know what page it really is, and they just look messy etc.
On Thu, Aug 16, 2012 at 11:08 AM, Simon Dale notifications@github.comwrote:
I think the maximum URL stub limit should be 80 characters. I did an experiment with 50 and that was too short for a lot of cases in the automated mapping script.
— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7774577.
cheers Brendan
I agree with that 3 phase process. BTW, have we created an issue for the confusing return url so it does not get lost? So many high priority issues.
I have mucked around with redirects to the new cannonical urls so
http://dev.thecrag.com/area/11740915
will redirect to
http://dev.thecrag.com/australia/arapiles
It's currently turned off via a flag. But when we are ready we can just turn it on.
Firstly I want to confirm that the reason why we want to redirect rather than simply internally route the url is only for google soe? If this is the case we can turn the flag on when we are ready then after Google has captured all the old url redirects turn the flag off, as we don't need unnecessary redirects.
Secondly I want to confirm that we only want to do this redirect for list view. In other words now that we have a loose SEO google strategy we should be focusing all our SOE efforts onto the 11k list view urls we have just created. I don't even think we want guide view to be indexed in google.
On Fri, Aug 17, 2012 at 3:09 PM, Simon Dale notifications@github.comwrote:
I have mucked around with redirects to the new cannonical urls so
http://dev.thecrag.com/area/11740915
will redirect to
http://dev.thecrag.com/australia/arapiles
It's currently turned off via a flag. But when we are ready we can just turn it on.
Firstly I want to confirm that the reason why we want to redirect rather than simply internally route the url is only for google soe? If this is the case we can turn the flag on when we are ready then after Google has captured all the old url redirects turn the flag off, as we don't need unnecessary redirects.
Once we are happy with the new urls AND all the template links have been fixed then the only people who will ever get to an old style url is from bookmarks, links in existing forum posts, etc. So redirecting these will only be a small amount of traffic. I see redirecting them permanently as just good housekeeping.
Secondly I want to confirm that we only want to do this redirect for list view. In other words now that we have a loose SEO google strategy we should be focusing all our SOE efforts onto the 11k list view urls we have just created. I don't even think we want guide view to be indexed in google.
So the question is do we redirect:
http://dev.thecrag.com/area/25938312/guide to http://dev.thecrag.com/climbing/australia/gara-gorge/area/25938312/guide
I think yes, again it's just good house keeping. The redirects aren't just for google, they are just as much for the humans as well. If we get all the template links right then people should never see the old url's and so never see the redirects. But if someone does get there from an old forum posting I think it add's value to show them the new better url. If they cut and paste it again then they are passing on the good one instead of perpetuating the old one. This will end up being a really small fringe case anyway.
— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7808607.
cheers Brendan
So you like a tidy house, well so does my wife, always nagging cleaning the bathrooms :)
Taking this a bit further if we get an old style route url:
http://dev.thecrag.com/route/11967355 # bard
we will have to look up the ancestor heirachy to find the nearest url stub and redirect to:
http://dev.thecrag.com/climbing/australia/arapiles/route/11967355
Two more questions about template data:
If so should I start the template recoding for this release? I think yes
On Fri, Aug 17, 2012 at 4:09 PM, Simon Dale notifications@github.comwrote:
So you like a tidy house, well so does my wife, always nagging cleaning the bathrooms :)
Taking this a bit further if we get an old style route url:
http://dev.thecrag.com/route/11967355 # bard
we will have to look up the ancestor heirachy to find the nearest url stub and redirect to:
http://dev.thecrag.com/climbing/australia/arapiles/route/11967355
yes
Two more questions about template data:
- the 'url' in the template should be the cannonical url. Do we want to commit to this now before the release or do you want to do some testing first.
We have over a month for the next release so plenty of time to test it before then.
- the template data variables now return one of two variables 'urlStub' and, if that does not exist, 'urlAncestorStub' - being the nearest ancestor with a url stub. This should be enough to build area and route urls in the template. Does this seem reasonable?
If so should I start the template recoding for this release? I think yes
yes, we should just have a linkNode template which is the only place with the logic or even tuck it away into a pm which that template uses. The less manually crafted urls in the templates the better
—
Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7809241.
cheers Brendan
In repo you have some redirect stuff to play with and test. I have not touched the templates yet. It's actually pretty cool to see it in action.
This brings up another thing with the heirachy say down to Little Thor level looks like this: http://dev.thecrag.com/climbing/world http://dev.thecrag.com/climbing/australia http://dev.thecrag.com/climbing/australia/victoria http://dev.thecrag.com/climbing/australia/victoria/area/140637204 # North West http://dev.thecrag.com/climbing/australia/arapiles http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag-area http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag/route/11960515 # Little Thor
This is exactly how we defined it according to our discussed business rules, but it helps validate our discussion by actually seeing it in action. Is the above hierarchy something that you guys expected? Are there any issues with not be strictly hierarchical now that you see it?
Also if we want to tell Google to index from world to lowest level crag and no lower then we are going to have to change the canonical url of "/australia/victoria/area/140637204" to something like "/australia/victoria/region/140637204" so google can follow the links all to find all our 11k url stubs. In other words urls with 'area' and 'route' will not be followed, but urls with 'region' and 'crag' will. Does this make sense?
On Fri, Aug 17, 2012 at 6:53 PM, Simon Dale notifications@github.comwrote:
In repo you have some redirect stuff to play with and test. I have not touched the templates yet. It's actually pretty cool to see it in action.
This brings up another thing with the heirachy say down to Little Thor level looks like this: http://dev.thecrag.com/climbing/world http://dev.thecrag.com/climbing/australia http://dev.thecrag.com/climbing/australia/victoria http://dev.thecrag.com/climbing/australia/victoria/area/140637204 # North West http://dev.thecrag.com/climbing/australia/arapiles http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag-area http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag
http://dev.thecrag.com/climbing/australia/arapiles/declaration-crag/route/11960515# Little Thor
This is exactly how we defined it according to our discussed business rules, but it helps validate our discussion by actually seeing it in action. Is the above hierarchy something that you guys expected? Are there any issues with not be strictly hierarchical now that you see it?
I'd probably aim for all regions to have slugs. I think this:
http://dev.thecrag.com/climbing/australia/victoria/north-west
clearly wins against: (shorter, clearer)
http://dev.thecrag.com/climbing/australia/victoria/area/140637204
Other than that its all good. The declaration area threw me a little until I realised this is two nodes.
I guess the only odd one out is the route node id and I'm happy with that as a pragmatic compromise.
Also if we want to tell Google to index from world to lowest level crag and
no lower then we are going to have to change the canonical url of "/australia/victoria/area/140637204" to something like "/australia/victoria/region/140637204" so google can follow the links all to find all our 11k url stubs. In other words urls with 'area' and 'route' will not be followed, but urls with 'region' and 'crag' will. Does this make sense?
No? Why would we tell google not to index an area or route under a TLC? That is shooting ourselves in the foot. We shouldn't have a blanket rule to not index all urls that contain /area/ - it should just be the old style that urls start with /area/ and maybe not even that. We shouldn't be telling google not to index things, we should be telling google that they have moved permanently, which different. In the former a search result, and it's relevance, will just disappear, and in the latter nothing will change except it's location, it retains it's relevance in googles eyes.
I've also entertained the idea of putting the node type into the url, but this means the url will change when the type changes which doesn't sit well with me. If we changed the type we should automatically put in an extra alias slug. ie as urls get moved around we should track the history and honour then all.
Also just read this:
http://www.seomoz.org/blog/uncrawled-301s-a-quick-fix-for-when-relaunches-go-too-well
Makes me even surer that we need to transition slowly (or as slow as google) between the two url structures, and watch google analytics each step of the way.
—
Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7812972.
cheers Brendan
Couple of business decisions:
Have we got a site map, it sounds as though we need one after reading that article?
We should go cautiously, but I also what to make sure all the coding is done so we can just switch flags at each stage.
BTW, there is no reason why we cannot honor all area types in the url. If necessary we can do a redirect to the right one so
http://dev.thecrag.com/climbing/australia/victoria/area/140637204
and
http://dev.thecrag.com/climbing/australia/victoria/region/140637204
would find the same url. If we are going to do the redirect to the right one then there is an extra db lookup.
I also agree that
http://dev.thecrag.com/climbing/australia/victoria/north-west
looks better and can be configured manually for each region node. The argument is that we should probably try harder to automate all these in the set script?
However we are not going to catch them all so we need to agree that the fallback structure is what we want. The same issue applies to cliffs below crag level. Eg Bard Buttress:
http://dev.thecrag.com/climbing/australia/arapiles/area/11764783
I'd say that at the very least we'd want google to index all grade icons and all 3 star routes and maybe all starred routes. I think it's pretty likely that people will search for 'Kachoong', 'el cap' etc.
On Saturday, August 18, 2012, Simon Dale wrote:
Couple of business decisions:
- Do we want guide view indexed in Google?
- Do we want routes indexed in Google? The main reason why not for both of these is that indexing them creates a bigger background crawler load. I could go either way.
Have we got a site map, it sounds as though we need one after reading that article?
We should go cautiously, but I also what to make sure all the coding is done so we can just switch flags at each stage.
BTW, there is no reason why we cannot honor all area types in the url. If necessary we can do a redirect to the right one so
http://dev.thecrag.com/climbing/australia/victoria/area/140637204
and
http://dev.thecrag.com/climbing/australia/victoria/region/140637204
would find the same url. If we are going to do the redirect to the right one then there is an extra db lookup.
I also agree that
http://dev.thecrag.com/climbing/australia/victoria/north-west
looks better and can be configured manually for each region node. The argument is that we should probably try harder to automate all these in the set script?
However we are not going to catch them all so we need to agree that the fallback structure is what we want. The same issue applies to cliffs below crag level. Eg Bard Buttress:
http://dev.thecrag.com/climbing/australia/arapiles/area/11764783
— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7836949.
Campbell
On Sat, Aug 18, 2012 at 8:29 AM, Simon Dale notifications@github.comwrote:
Couple of business decisions:
- Do we want guide view indexed in Google?
no
- Do we want routes indexed in Google? The main reason why not for both of these is that indexing them creates a bigger background crawler load. I could go either way.
double yes.
Have we got a site map, it sounds as though we need one after reading that article?
No, it would a very big site map. If we do this it's probably worth making this a template in its own right
We should go cautiously, but I also what to make sure all the coding is done so we can just switch flags at each stage.
BTW, there is no reason why we cannot honor all area types in the url. If necessary we can do a redirect to the right one so
http://dev.thecrag.com/climbing/australia/victoria/area/140637204
and
http://dev.thecrag.com/climbing/australia/victoria/region/140637204
would find the same url. If we are going to do the redirect to the right one then there is an extra db lookup.
I also agree that
http://dev.thecrag.com/climbing/australia/victoria/north-west
looks better and can be configured manually for each region node. The argument is that we should probably try harder to automate all these in the set script?
yup
However we are not going to catch them all so we need to agree that the fallback structure is what we want. The same issue applies to cliffs below crag level. Eg Bard Buttress:
http://dev.thecrag.com/climbing/australia/arapiles/area/11764783
Should we open the can of worms and try and figure out a simple url structure that works for all areas and route inside a tlc as well? I'm very happy with progress so far and happy to leave this for a few releases.
—
Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/683#issuecomment-7836949.
cheers Brendan
OK we are going to keep all index view nodes (inc. crags, areas, cliffs and routes) in the google index. And it looks like everybody is pretty happy with going out with the framework we have devised.
I have just done a massive task auditing all the templates and making sure all instances of index urls are built through either a template function or a web helper perl function. There were many changes to many files. I tested as I went through, however there is a possibility that I missed something.
I'm going to keep the redirect off for a little bit while we test the templates are producing the correct urls. When the redirects are in place they will hide errors in the templates.
Brendan, I think you are away so when you get back please pull from repo so that we can all see the templates pushing out the right urls. Actually if I get a chance I will probably just update dev.
I'm not going to do anymore on this for the next release, unless you find testing issues. I will turn on redirects a couple of days before the release so we can test that too. I will need to turn off redirects during the release then turn it back on after all the release post processing is complete.
Is simple but quite a bit of work. We've discussed in various places but there doesn't seem to be a dedicated issue so here it is.
End game is a human readable, seo friendly url like:
thecrag.com/australia/arapiles
Design principles:
We've talked a lot about TLC namespace issues and the possibility of a 'graduation' process where a TLC name is allocated. Seeing as we already use the terminology of a canoncial link I'm going to coin the term canonisation for when a node gets a reserved 'slug'.
My regurgitated thoughts and memories of what we've previously discussed + a lot more ideas fleshed out:
World
World node should go to DONE
/world
Continent
All continent regions should go to their slug - very safe and will almost never change / clash
/asia /europe
Continent regions
Not sure if we still need these but regions under a continent but not yet a country would be top level:
/asia/south-east-asia
Country
All countries should go to their slug without the continent slug. Countries are unique between each other so can be safely used as the top level paths, removing the continent level slugs.
/australia /new-zealand - all spaces an funky chars replaced with - /usa
Regions
If a regions has a 'short' alternate name then use it instead. Short names for regions should probably only be admin editable as a result (was is the current rules for this?). All regions within a country would be nested under country
/australia/nsw /australia/qld /usa/ca /usa/ny
If a regions has a 'short' alternate name then use it again. (ditto check the perms for editing short name)
If regions are nested then keep stacking them up:
/australia/nsw/northern-tablelands /australia/nsw/sydney
Top level crags
All crags can always be uniquely identified by a qualified url like
/australia/nsw/northern-tablelands/gara-gorge
I'll call this a 'TLC base path'
It is worth pointing out that at this level every url so far is guaranteed to be unique, mostly because the regions are controlled by us and the clashes are avoided by nesting.
Areas under the TLC
Nodes inside the TLC. If we continue as above the url's for a cliff or boulder will get pretty large. Once we get to below the TLC level additional nodes will not be added to the url, as their important drops off quickly. As simple compromise solution is just the TLC base path and the current node id:
Example mappings inside a TLC:
Gara Gorge > Upper Gara Gorge (crag) http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/8040888
Gara Gorge > Upper Gara Gorge > Central Boulders (area) http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/25938312
Gara Gorge > Upper Gara Gorge > Central Boulders > T-crack Boulder (boulder) http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/27283101
Gara Gorge > Upper Gara Gorge > Central Boulders > T-crack Boulder > Nose (boulder problem) http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/27283821
This way urls are limited and never get too long but the key location slugs are still visible.
Routes
There is some value in knowing from the url wether you are looking at a route or area but I don't think it is critical. So a url like the 'nose' above should be fine. If we still want the 'route' url path then so be it:
http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/route/27283821
other url suffixes
All pages that currently sit under the area's will continue to do so ie:
http://www.thecrag.com/australia/nsw/northern-tablelands/gara-gorge/25938312/topos
Pages like the facets will not change at all and keep the id. They have long enough urls as is. If we felt nice we could add a dummy chunk to the url similar to what we do with the forum slugs:
So instead of: http://www.thecrag.com/ascents/at/11740915/
We could have http://www.thecrag.com/ascents/at/11740915-arapiles/
The chunk after the - would just be ignored, and could be anything, but the templates would generate nice links to make it clear fro the url what is going on.
Do we stop here?
To be honest as a first cut, and possibly even ongoing I'd be happy with just this. Between cutting out the continent slug, and all the slugs under the TLC, as well as replacing long names with short names, the urls are readable, valuable, short and pretty clean.
It short circuits the whole discussion around canonised urls but I'll still discuss it below anyway. The only real issue with this scheme is if there happens to be a heap of nodes with long names between the country and TLC level for some particular TLC's. I remember Simon came up with a number for the deepest node level which was like 13 or something but I don't think the depth of the deepest TLC would be that deep inside a country. Generally Country > State > Province > Crag would cover 95%? of the TLC's in the world. We need to numbers to back this up.
Canonisation
The questions is how and when can the 'TLC base urls' be shortened to something else. There is also the question of wether this should simple be an allowed url or wether it should be the canonical url and what we use internally to link between pages. Ideally we should accept more
Lets assume that it's desirable to throw away sub country regions where we can, eg
Australia > Victoria > North West > Arapiles /australia/vic/north-west/arapiles
would become:
/australia/arapiles
or possibly just
/arapiles
In this case the original url wasn't so long that it was unwieldily so to me this isn't a compelling argument to get rid of them. I think the country in particular adds value. The other sub country regions don't add as much value but they aren't causing any pain either.
Regardless, if we assume that canonization IS something we still want then I'd suggest it only removes those sub country region slugs and leave the country slug intact. To me this suggest two levels of canoninzation, level 1 which is reserved for the world nodes children (ie continents only in theory) and countries - but not nodes between continent and country. The second level is for the TLC's that deserve the promotion up under their country.
As far as a process to bootstrap is, I'd pick the TLC's across the world, set a minimum threshold like 200 routes to filter out the minor crags. then work from the biggest crags down and set their flag and assign their slug until you run out. If you get a clash then unset the first one and review it manually. I would guess there would be very few of these left, between the min routes filter and the country level uniqueness.
Routing rules
This scheme I feel should be pretty quick. Compared to all the other DB raping we're doing the lookups to do this are nothing and should be lightning quick. It should only be searching the list of < 5000 canonised names or the list of child names of a previously parsed node. Both will be very quick.
Issues
Merging - these shouldn't be an issue after the 301 permanent redirect code gets into prod
Renaming - this could be an issue, but mitigated because regions are under our control. We could easily implement a similar 301 redirect scheme which remembers old region names but I don't think the rick is worth the hassle.
Unicode - Url's is almost all modern browser accecp unicode so we can have funky exotic names in the urls. That said there needs to be some mapping and sanity applied to url's.
Translations - at some point if we want to take over the world we're gonna have to localize the code and get translations done. The easy route is that the url's stay in english. The issue is that right now we have node names that aren't in english, or worse have two languages mixed together like in china. This probably isn't actually a real issue but just needs some testing.
Todos
If we canonise: