theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
109 stars 8 forks source link

Completely automate simple clean urls for nodes #1306

Open brendanheywood opened 10 years ago

brendanheywood commented 10 years ago

Been doing a few url cleanups and addition in places and the current system has a few rough bits I reckon we can smooth out. Lets pick a url like:

http://www.thecrag.com/climbing/australia/mount-wellington/area/11868313

It's fairly clean, and its ancestors back down the chain are fairly good too:

http://www.thecrag.com/climbing/australia/mount-wellington/organ-pipes http://www.thecrag.com/climbing/australia/mount-wellington http://www.thecrag.com/climbing/australia/tasmania/hobart http://www.thecrag.com/climbing/australia/tasmania/area/381838677 http://www.thecrag.com/climbing/australia/tasmania http://www.thecrag.com/climbing/australia

The things I find a bit of a pain:

What I think we don't want is url's like this:

http://www.thecrag.com/climbing/australia/tasmania/south-east/hobart/mount-wellington/organ-pipes/area/11868313

If we weren't world wide and such a deep structure then this would be the easy solution, other sites do this eg http://climbnz.org.nz/nz/si/canterbury/port-hills/britten-crag/the-alcove/the-zimmerframe-owner-strikes-back

So what I'm thinking is keeping the same system but with a few tweaks:

Or another way to put it, we grab all priority crumb parts, plus the lowest non priority one if it exists, then concat them to form the urlStub.

So lets say that this is the hierarchy, the nodes with an X are 'priority'

Then we would get this set of url's automatically maintenance free:

http://www.thecrag.com/climbing/australia/tasmania/mount-wellington/organ-pipes/area/11868313 http://www.thecrag.com/climbing/australia/tasmania/mount-wellington/organ-pipes http://www.thecrag.com/climbing/australia/tasmania/mount-wellington http://www.thecrag.com/climbing/australia/tasmania/hobart http://www.thecrag.com/climbing/australia/tasmania/area/381838677 (south east) http://www.thecrag.com/climbing/australia/tasmania http://www.thecrag.com/climbing/australia

If we did this on conjunction with #1025 it would be pretty sweet.

This also ties in with a similar concept, at the moment I am declaring to google that some crumb trail items are more important than others for how they appear in search results:

image

I just made up some rough rules but it would be better to ditch them and hang off this priority flag instead.

And lastly to be thorough, we could automate which ones get the priority flag. eg Countries, yes, Crags, yes, but it gets a bit vague with the regions. I can't see an obvious rule which comes up with which regions I think should be given priority that feels right.

scd commented 10 years ago

I imagine something like this some time in the future. If I understand correctly fixing the legacy slugs could involve a lot of nodes, for example if we changed /australia/ to /aus/ then every descendant slug would have to be changed. Same issue with reparenting and merging.

The admin of url slugs is pretty crap at the moment so eventually something will have to change.

cgome commented 10 years ago

Agree. Anything that automates this process would be a big improvement. On Dec 20, 2013 11:45 PM, "Simon Dale" notifications@github.com wrote:

I imagine something like this some time in the future. If I understand correctly fixing the legacy slugs could involve a lot of nodes, for example if we changed /australia/ to /aus/ then every descendant slug would have to be changed. Same issue with reparenting and merging.

The admin of url slugs is pretty crap at the moment so eventually something will have to change.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/1306#issuecomment-31007298 .

brendanheywood commented 10 years ago

Another little example, though more about why the current google crumb trail sucks than the url stubs:

image

It says 'South East' but south east what? It looks like South East Australia (coincidently correct) but is actually South East Tasmania,

brendanheywood commented 8 years ago

Been thinking more about this one lately, in particular about not only completely automating it (just re-read everything above and all still seems to have stood the test of time) - but I'd add one extra small feature which is to have both a long form slug and potentially a short form slug for any node, but in particular for region nodes. ie I would have every country in autogenerated from it's actual name, but also it's 2 or 3 digit country code. The same for states or other political sub regions. The idea being that once you get a level or two below them we'd swap to the shorter version if one exists.

So given this data:

(as an aside I think that if any node has a short slub option, then it should always have the crumbtrail 'priority' flag

We could completely autogenerate this set of urls:

/climbing/australia /climbing/australia/tasmania /climbing/au/tasmania/south-east /climbing/au/tas/south-east/hobart /climbing/au/tas/hobart/mount-wellington /climbing/au/tas/mount-wellington/the-organ-pipes /climbing/au/tas/mount-wellington/johnstones-knob

I don't think correcting the urls after a largish reparent is not that big a deal, it's almost exactly the same conceptually as updating all the hierarchy node id's and can be done as part of the same process in the background. And we are only ever adding a new url and moving the old one down a notch so there will be no intermediate broken state to worry about.

Lots of continual reparents or renames will leave behind a potentially long trail of url history for each node, but again I'm not too worried about this. We could just have a simple even manual process once a year to remove urls from more than say a year ago if it ever becomes a problem.

I'll also edit the top set of rules to take this into account

birgander2 commented 8 years ago

Just comment: This would help a lot to improve Google page rank! In fact, theCrag websites are sometimes astonishingly low on Google...

brendanheywood commented 8 years ago

Realistically I don't expect this to have any impact on ranking at all. The URLs are already fairly clean, this is mostly around removing the administration burden

scd commented 8 years ago

@birgander2 why do you think it would help in google page ranking. I agree that thecrag is lower than I would expect. I get the feeling that there is something we are missing, but I am not sure what?

birgander2 commented 8 years ago

Google page rank is a science on its own. But keywords (early) in the page title as well as in the url are one of the most important "tricks" to get a better rank - urls as "area/3714293417" or "route/42938740" are not a good idea. No idea if changing them helps significantly, there might be more optimisation necessary...

scd commented 8 years ago

Ta. We currently have manual url assignments. The auto generation will mean that we are capturing this more, so you could be right about improvements to seo.

@brendanheywood would you see it going down to route level.

How would we deal with external resources pointing to old urls after a rename, reparent or merge?

brendanheywood commented 8 years ago

Yeah we've looked into this and put a fair bit of through into it. Url stuffing is a factor but a minor one, and this is more about human readability and usability. Many websites simple don't give a shit and google reflects that reality. Most modern browser are moving to hiding the urls for this reason because in most cases they are simply not worth showing.

Re route level, yes I'd really like to get routes too (as well as forums, and everything else) - however technically I think we can solve that very easily using the idea in #1025. It isn't as elegant but means very little internal change for us.

brendanheywood commented 8 years ago

How would we deal with external resources pointing to old urls after a rename, reparent or merge?

We already deal with that quite well and internally it would not change. After a reparent we'd auto generate the new url and then add this to the 'Url stub' field and move whatever was in that field to the 'Rewrite URL Stubs'. After several renames you'd have multiple 'Rewrite URL Stubs' which is fine and what this comment was about:

Lots of continual reparents or renames will leave behind a potentially long trail of url history for each node, but again I'm not too worried about this. We could just have a simple even manual process once a year to remove urls from more than say a year ago if it ever becomes a problem.

If it did ever become a problem we could also record for each url when it was lasted accessed and remove any which haven't seen an impression in more than a year or whatever.

scd commented 8 years ago

Our current implementation of url stubs relies on absolute stubs (ie 'australia/arapiles' maps to arapiles node) even though it is written as relative stubs. If we allow real relative stubs and define absolute stubs with a leading '/' then I think we can implement this without to many changes.

An absolute stub makes the lookup really quick. Eg lookup '/australia/arapiles' in stubs and if there is a match return node. Relative stubs will be a bit slower because we now have to traverse look up australia first then victoria then arapiles. This only happens once per page view, so I don't think it will have significant effect on overall performance. Besides, we are getting good at optimising MySQL and it could be a fun bit of work.

Relative url stubs

If a URL stub associated with a node is relative then to find the full url path we have to go to the parent node to get the full url path. Eg arapiles, go to parent, get northwest, go to parent again get victoria, go to parent again get australia, go to parent get '/' The end result is '/australia/victoria/northwest/arapiles'. (Note that abbreviations may still come in).

The system should be able to handle both absolute and relative stubs, so at any point if an absolute path is encountered it does not have to go to the parent.

Relative url stubs mean that re parenting comes for free with one exception (mentioned below).

Automation

Every node automatically maps itself to a canonical relative stub. For example:

External resources

The automation procedure would not delete old relative stubs, so in the example above 'declaration-crag-area' would still be associated with Dec Crag node. This means external resources linking to the old stub would still work (eg Google indexing).

Also if we change paths from '/au/victoria...' to '/aus/victoria...' then the old 'au' will remain in the system so external sites can still reference the old resources.

We would have to prioritise so that if a node is created with the same name as non-canonical historical stub then we would have to remove the historical version.

Reparenting will cause issues with external resources. For example if we were to reparent 'Arapiles' from 'North West' to 'Victoria' then '/au/vic/nw/arapiles' would no longer find arapiles, because 'arapiles' is no longer in the North West relative path.

I think the benefits outweigh this deficiency. If there are issues we can create a ghost node for the sole purpose of url stub matching on the old parent area. I just don't think it is worthwhille doing this.

Abbreviations

If we added an 'au' relative stub to the australia node then '/au/victoria/northwest/arapiles' would find Arapiles.

However we want to go one step further and tell the system that '/au' should be used as the canonical base for all descendant nodes.

This means we need add a flag to our canonical url.

The later can be added manually at any stage. If not added then it will fall back to the leaf node version.

This means we could release and slowly update the abbreviations at our leisure.

Routes

In this implementation there is no difference between routes and areas. Therefore routes are for free.

We probably need to put in place a name length for routes and area urls so maximum url lengths are not exceeded.

Language

If we add a lang field to the url stub then language works exactly the same way.

Will Google need a single conical url?

If an area has default langauge set to 'German' then the cannonical url will be the german version including 'klettern' part.

I think we need to add a Language field to node names. This will mean we can fullfill the automation rules for language.

TASK LIST

brendanheywood commented 8 years ago

Moving predominantly to 'relative' stubs will solve heaps of problems. Absolute ones would only be just for legacy and we'd probably be able to drop them in a year or so anyway.

One big thing that we need to decide with whether every ancestor will be in the url or not. If we do then the logic becomes a lot simpler and avoid a few classes of issues. Otherwise at every stub we'd need to do a lookup inside every node's descendants on not just it's immediate children. I'm leaning towards simpler and just using all nodes in the url. If we add abbreviations for all region level nodes that will make a big difference to url length. But one big valid exception to this would be continents, I definitely don't want them in the urls. But as soon as we have some exceptions the algorithm still gets a bit muddy, maybe there is some small compromise in the algorithm we can make so keep it fast but no require a full scan of every descendant.

Renames will work well with the relative stub system, but re parents may fail if we are only using relative stubs. So perhaps the solution here is that all legacy stubs are absolute, and the current stub and alternates, are relative. We need to double check whether this logic is enough.

I think the lang stuff needs a bit more fleshing out. In particular the idea of a single canonical url doesn't quite work, we will really have multiple canonical urls one for each language.

So applying the above logic, and assuming all nodes are in the url (except continents) using the example from #1748 assuming we have 1 crag translated into three languages we'd have three canonical urls:

https://www.thecrag.com/en/climbing/esp/valencia-cuenca/alto-mijares - english https://www.thecrag.com/sp/escalada/esp/valencia-cuenca/alto-mijares - spanish https://www.thecrag.com/de/klettern/esp/valencia-cuenca/alto-mijares - german

But there might be a large number of alternate names, or leftovers from previous renames. But each of those would point at exactly one of the 3 language variants above as it's canonical url.

Each page would also refer to the canonical url of the other languages in a hreflang link attribute.

I think it's probably easiest and also best if the only thing that determined which language the page returns was the initial 'en' 'sp' 'de' token, and any of the names of the url stubs would match. For almost all nodes we would probably just have the 'en' version because it could be the same in most other languages, but we'd still want the other pages to find them.

scd commented 8 years ago

Moving predominantly to 'relative' stubs will solve heaps of problems. Absolute ones would only be just for legacy and we'd probably be able to drop them in a year or so anyway.

Yes at our leisure. There may be reasons why we want to keep it in (eg short gym urls). We don't have to worry right now as long as the system supports both.

One big thing that we need to decide with whether every ancestor will be in the url or not.

My assumption is that we doing this.

If we do then the logic becomes a lot simpler and avoid a few classes of issues. Otherwise at every stub we'd need to do a lookup inside every node's descendants on not just it's immediate children.

While we have a table which would allow us to do this, the query would be slower and sometimes be ambiguous.

I'm leaning towards simpler and just using all nodes in the url. If we add abbreviations for all region level nodes that will make a big difference to url length.

Yes this is what I really liked about your proposal and has motivated me to want to finish this issue off.

But one big valid exception to this would be continents, I definitely don't want them in the urls. But as soon as we have some exceptions the algorithm still gets a bit muddy, maybe there is some small compromise in the algorithm we can make so keep it fast but no require a full scan of every descendant.

Maybe you could expand your thoughts with the continent. Is your view that continents are a necessary evil?

I see no problems with '/europe' taking you to the Europe node. There are a couple of options to implement '/germany' skipping europe and taking you straight to germany.

  1. Implement the abbreviation as an empty string, so the '/germany' string would take you straight to germany. Actually the empty string would implement '//germany' which is a bit strange. Maybe we could do it as a flag.
  2. Use the absolute '/germany' stub at each country. This would mean the continents are just skipped.

If it is just a problem for continents then the absolute stub is the way to go. If we want to do the same for regions then the flag is the way to go.

Renames will work well with the relative stub system, but re parents may fail if we are only using relative stubs. So perhaps the solution here is that all legacy stubs are absolute, and the current stub and alternates, are relative. We need to double check whether this logic is enough.

If we reparent we do not want to have to create legacy absolute stubs for all descendants. Turning the current node into a legacy absolute stub would provide a mechanism to map all descendants via the single absolute legacy stub, but if we than subsequently reparent one of the descendants then we would have to create multiple legacy absolute stubs in order to completely preserve all legacy.

I think the lang stuff needs a bit more fleshing out. In particular the idea of a single canonical url doesn't quite work, we will really have multiple canonical urls one for each language.

What about Google? Don't we have to tell Google one canonical URL.

I think that we should be telling Google that the

https://www.thecrag.com/de/klettern/deutschland

url is the canonical url because Germany has a default language of 'de'.

However if an australian user was navigating the index then they would see the

https://www.thecrag.com/en/climbing/germany

As the canonical url.

So applying the above logic, and assuming all nodes are in the url (except continents) using the example from #1748 assuming we have 1 crag translated into three languages we'd have three canonical urls:

https://www.thecrag.com/en/climbing/esp/valencia-cuenca/alto-mijares - english https://www.thecrag.com/sp/escalada/esp/valencia-cuenca/alto-mijares - spanish https://www.thecrag.com/de/klettern/esp/valencia-cuenca/alto-mijares - german

But there might be a large number of alternate names, or leftovers from previous renames. But each of those would point at exactly one of the 3 language variants above as it's canonical url.

+1 - this is exactly how I would think it would work.

Each page would also refer to the canonical url of the other languages in a hreflang link attribute.

Is that how it works? So when we have 10 languages, we will have 10 extra lines of hreflang languages for every page.

I think it's probably easiest and also best if the only thing that determined which language the page returns was the initial 'en' 'sp' 'de' token, and any of the names of the url stubs would match. For almost all nodes we would probably just have the 'en' version because it could be the same in most other languages, but we'd still want the other pages to find them.

Our legacy urls all have no 'en'. Can we do it without 'en', 'sp', etc. If we see '/klettern' we know it is de lang?

brendanheywood commented 8 years ago

If it is just a problem for continents then the absolute stub is the way to go. If we want to do the same for regions then the flag is the way to go.

I think it's probably going to be safer and more future proof if we do it with flags rather than assume we will only ever want it to remove continents. There are some very low value region names which are more about making things easy to navigate.

What about Google? Don't we have to tell Google one canonical URL.

This is just the semantics of 'canonical'. You could argue that an automated translation of an english page into german should have a canonical url of the original english page. Or you could also argue that they are separate pages and both canonical. Google's chosen semantics are the latter. I think their reasoning is that in translating anything properly the content naturally diverges from a pure translation, ie two wikipedia pages on the same subject in different language could have quite different internal document structure. Or because of the nature of the language two pages could be merged in on language but not in another. Thankfully we can safely assume that all languages will have exactly the same node index structure, but wikipedia doesn't have that luxury.

An analogy could be published books, each language version is considered a different product and so gets a new ISBN.

To flesh it out with an example lets assume we a node which has an 'en' name and a 'de' name and it's descriptions are also available in both languages as well. Assuming we go with the more flexible option of matching any language name alternate in the relative stub, then we get 4 permutations which map to 2 canonical urls:

/de/klettern/deutschland -> https://www.thecrag.com/de/klettern/deutschland
/de/klettern/germany     -> https://www.thecrag.com/de/klettern/deutschland
/en/climbing/germany     -> https://www.thecrag.com/en/climbing/germany
/en/climbing/deutschland -> https://www.thecrag.com/en/climbing/germany

On top of this we additionally tag all of them with the 'other' languages canonical page using the hreflang link so google knows they are related but non canonical. We also need to have some clear rules around when a page is deemed to 'exist' in a new language. I don't think having an alternate name in another language is enough, it needs to have real content either markdown for the area fields or in at least one direct child.

The thing I want to avoid is if we turn this new system on and we end up with all the empty mirror sites. Ie lets say we have the codes for 200 languages in the DB, and say Nicky adds some german descriptions for the Germany node but no actual crags just yet.

So if you visit this page you'll get a nice german page, and only show you info it has in german and filter out all the english or other languages:

/de/klettern/deutschland

but if you click on the Berlin child which isn't translated, which url should it take you to?

/en/climbing/germany/berlin

or to

/de/klettern/deutschland/berlin

The first will mean you keep jumping around different language versions as you navigate. Which might be annoying but really the most useful behaviour. The second I see as quite dangerous as it means google will start scraping it and treating it like a real page. Once a single page is seeded in a new language google would then scrape the whole index in that language which would be very very empty.

We also have to be careful that once somebody who prefers german, ie their browser has the headers for german set, that we prioritise sending german content if they have it. But at the same time we don't stop them from viewing the english version if they want. Because they are different url's this is easy, but if they are browsing in german and get booted to english on one page because it's not translated, and then they navigate back, should it go 'back' to the english page or the german page, and important how do we tell these two scenarios apart? If we render the links that change depending on what language a person prefers then we need to make sure that the http Vary header includes Accept-Language. So I think we need a session variable / user preference which overrides the browsers Accept-language.

On the flip side, much of the region level index doesn't actually have much content but probably should be scraped and indexed. And some crags right now are just a list of routes and topos with no descriptions so are equally valid as a german page as an english page. So why shouldn't they appear as so? So we need some good heuristics for when to decide that a node 'exists', and probably a good rule of thumb is if any of it's descendants are translated.

Getting back to the name matching, there are a few reasons I think we need to be flexible with node name matching across any language:

scd commented 8 years ago

If I have selected my langauge to be German and go to the page /en/climbing/germany/berlin which has no German content translations. Does the template headers/etc still show in german?

Or if I was a German visiting Arapiles I would want to see as much as possible in German. So the page en/climbing/australia/arapiles have the headers and menus translated into german but not the area descriptions.

This does not quite seem right to me? Unless the en is a weak language specifier applying only to user content.

This would be resolved if the German was visiting Arapiles as de/klettern/australia/arapiles.

Can we just do without the leading en so they visit climbing/australia/arapiles and the user lang setting/browser setting does the rest.

brendanheywood commented 8 years ago

There are two very separate concerns here, one is what a human user see's and the logic around combing their account pref's, their browser accepts header and the /en/ or /de/ prefix and deciding what language to show. The second is around sending the right canonical url to google for indexing.

We need /en/ and /de/ to distinguish different language versions of the same page as they must have different canonical urls, and this should apply to any page page across the entire site which can be translated, not just index pages. I think as long as we are careful to cross link between the canonical versions of the pages google won't mess up our pagerank. So as soon as we flip the switch on a new language google will crawl the whole index twice. I was previously thinking we'd selectively only turn on canonical urls for content where we know it is translated, and still on the fence, but moving towards just the simpler model of it all going live at once, even if it serves mixed language content (ie nav in lang X, content in lang Y)

I agree with your logic above, if I as an english speaker went to germany I'd expect to see much of the routes in german but the nav in english.

The other thing we should consider is are the content languages tightly couples to the translated nav langauges? ie right now we have pages that are in chinese or russian or whatever, should we make an attempt to let people tell us what language the content is in, without requiring that language to also be translated in the nav?

scd commented 8 years ago

So as soon as we flip the switch on a new language google will crawl the whole index twice. I was previously thinking we'd selectively only turn on canonical urls for content where we know it is translated, and still on the fence, but moving towards just the simpler model of it all going live at once, even if it serves mixed language content (ie nav in lang X, content in lang Y)

So if en/climbing/australia will server German content to a German where possible. So the main motivation if for Google indexing. I am on board with that. Do you know of any good articles on Google page rank for multiple languages, or is your expertise just from experience. I guess we should make sure our expertise is up to scratch before we commit to major changes.

The other thing we should consider is are the content languages tightly couples to the translated nav langauges? ie right now we have pages that are in chinese or russian or whatever, should we make an attempt to let people tell us what language the content is in, without requiring that language to also be translated in the nav?

I think this is fairly simple to resolve in a similar way to default grading context. So the world would have a default langauge of english and germany would have a default language of german, but a particular crag in germany may default back to english. We can just store this as an inherited variable like we do grade context.

brendanheywood commented 8 years ago

This is all from reading the google help docs and matt cutts blog, eg

https://support.google.com/webmasters/answer/182192

https://support.google.com/webmasters/answer/189077?hl=en&ref_topic=2370587

I'm still in two minds about the mixed serving, if I was say a german, but I can also read french and italian, and I'm looking at a crag which doesn't have a german translation but does have french and italian I'd want to be able to easily flip between the two available languages. So regardless of the UX, under the hood would we be swapping the users prefs (I'd say no), or settings a temporary session, or we could simple just link between the /fr/ and /it/ versions and that's it.

I guess the good thing is that we don't need to get it right first time, and we can evolve it. The only thing we really need to nail is what we serve to anonymous people and google, and that's much simpler as in those cases we only have the prefix (google which doesn't set accepts headers by default, see this https://support.google.com/webmasters/answer/6144055?hl=en), or humans who would have both a prefix and the accepts headers.

I think this is fairly simple to resolve in a similar way to default grading context.

We are talking about two different things. Yes we do need default language(s) like the grade contexts concept. But I meant we are shortly going to have EN, KO, IT, and DE as our nav languages. Hypothetically we may not get to translating the navigation to russian for a couple years. But at some point we are also going to roll out the ability for editors to specify which language content is in and add multiple versions. We should not stop a russian editor from adding a russian translation of content just because we haven't translated the nav. So in this case what url would we use, and what would google see? If they provide 3 lang versions then we really should have 3 canonical urls, which means that the particular russian lang url version, by virtue of having translated content, should be served under /ru/, but maybe would default to show the english nav?

brendanheywood commented 7 years ago

Just another data point, Datça, but we currently need to reduce the url stub to English only lowercase eg:

https://www.thecrag.com/climbing/turkey/datca

So need to find a good solid perl lib to reduce essentially any word in any language back to a hopefully useful human readable stub. It may be that in some languages, eg character based asian ones, that we don't do this at all and stick with a number as it is, or the english version.

brendanheywood commented 5 years ago

+1 via @quaestor from https://github.com/theCrag/website/issues/2873

rouletout commented 3 years ago

Language specific URL's for areas were introduced. Let's observe what the impact is.