pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
297 stars 443 forks source link

Ensure that all languages are indexable by crawlers #699

Closed asmecher closed 4 months ago

asmecher commented 9 years ago

Describe the problem you would like to solve Users can switch languages while reading a multilingual journal. However, the currently active language is stored in a cookie on the user's device and not in the URL. As a result, a search engine crawler can not index information in languages other than the journal's primary language.

Describe the solution you'd like No consensus has been reached on a proposed solution.

Who is asking for this feature? Multilingual journals that want to be indexed by Google (not Google Scholar).

Additional information See http://forum.pkp.sfu.ca/t/keep-ui-archivable-by-heritrix-web-crawler/3207/6 for details.

NateWr commented 9 years ago

I've switched to a link in my ui branch of OMP. I'll update it for OJS too, but as you discussed in email, we still need to propogate via URL.

NateWr commented 8 years ago

OJS uses links for the language switcher (can't remember when this was implemented). But I think the issue of propagating the language within the URL is more in your wheel house. Assigning back to you unless you'd like me to look further.

asmecher commented 8 years ago

Sure, I'll take a look.

asmecher commented 8 years ago

Hmm. I don't like constructs that only show up for search engines, e.g. via user agents. So we're left with adding the language to system URLs in the general case, which I'm hesitant to impose on single-language journals; adding this as an optional mode could provide flexibility for both types of users, but switching between them might be catastrophic as all URLs would change. We could potentially have the URL generation code add a URL parameter for language, which would allow interoperability between the two modes -- but this would need to behave well with e.g. POST forms and Javascript, which might not be expecting URL parameters to suddenly get included. Deferring pending more consideration.

Vitaliy-1 commented 7 years ago

Greetings @asmecher How can I add language in the url on article detail page? For example my aim is to add additional parameter only for non-primary locale (Ukrainian in our case). The problem is that there is no other way for Google to index it...

I have already got some experience in PHP and Java EE, so hope if you guide me I could manage this problem. From where to start?

asmecher commented 7 years ago

Hi @Vitaliy-1 -- the code for this is pretty much constrained to pages/article/ArticleHandler.inc.php. PATH_INFO URL components come in via the $args parameter to each function. Have a look there and see if it makes sense -- let me know if you get stuck somewhere specific.

Vitaliy-1 commented 7 years ago

Thanks for reply @asmecher ,

Hmm, $args is an array, that from my point of view contains only article id. $request is an Request Object from which I can, for example, retrieve URL, redirect request, but not to change it somehow. PATH_INFO can be seen in context of $_SERVER array. Do not see the way to modify URL here. I am missing something...

Can you show me an example of URL mapping?

I know that view function (method of this Handler class) is crucial for displaying article landing page. It is responsible for the view part of URL. How is it possible to change it from view/ to view/uk/. Or maybe to work with the last part of URL, article id, is better? Where actually the latter is come from? I though from articleid variable but changing it not make any effect...

So, I am thinking about something like:

$currentLocale = AppLocale::getLocale();
$defaultLocale = AppLocale::getPrimaryLocale();
if ($currentLocale != $defaultLocale) {
  $addToUrl = substr($currentLocale, 3, 2);
  //add $addToUrl to Url
}

Maybe just create new page with this url pattern and redirect like this to it. But in Java it is possible to map one servlet to several url patterns. I am confused.

Vitaliy-1 commented 7 years ago

Hi again, @asmecher

It's not easy without much experience in programming to read and understand others` code. But I know that you haven't got much time for helping others to write the code.

After browsing classes I found PKPPageRouter class and its method route https://github.com/pkp/pkp-lib/blob/master/classes/core/PKPPageRouter.inc.php#L146 Suppose it picks up entered by user url and associates with specific ojs file. There is a hook inside called LoadHandler which carries 3 variables. $page and $op seems to represent parameters from url and $sourceFile represents path to smarty template (I hope).

I have created a mockup of a plugin here to manage this hook: https://github.com/Vitaliy-1/localeRedirect/blob/master/LocaleRedirectPlugin.inc.php

Can you confirm that I am on the right path? Or you wouldn't use this hook for specified earlier task?

Vitaliy-1 commented 7 years ago

Another approach, that I found, is to modify initialize function inside ArticleHandler class. As an quick example, with what planning to work:

function initialize($request, $args) {
        if ($args[0] == "uk_UA") {
            $articleId = isset($args[1]) ? $args[1] : 0;
            $galleyId = isset($args[2]) ? $args[2] : 0;
            $request->getSession()->setSessionVar("currentLocale", "uk_UA");
        } else {
            $articleId = isset($args[0]) ? $args[0] : 0;
            $galleyId = isset($args[1]) ? $args[1] : 0;
        }
        // original code here 

        return $request
}

So the question remains what approach is better in your opinion? Or non of them? And will google actually see that page for selected locale?

asmecher commented 7 years ago

@Vitaliy-1, my worry is about ambiguity in URLs. If I'm reading correctly, this would result "equivalent" URLs like...

However, that last one could be read two ways: a galley view with article ID "smecher17", galley ID "pdf", or an article view with locale "smecher17" and article ID "pdf". We can code around it here but there will be lots of knock-on complication, e.g. in parsing URLs for statistics calcuations in the log files.

I think it's definitely necessary to...

What about using an optional URL parameter, e.g.: .../article/view/smecher17/pdf?locale=uk_UA? It's not as pretty as your proposal, but isn't ambiguous, and it should be clear to readers how it'll behave. To facilitate indexing, I would think the only additional thing that's needed is better linking to different-language versions, in the front end and probably also in meta content.

Vitaliy-1 commented 7 years ago

Greetings @asmecher

While writing the code I have encountered a problem with language toggle. As an example of changing locale:

$_SESSION["currentLocale"] = "en_US"; or $request->getSession()->setSessionVar("currentLocale", "en_US");

The lines above are changing actual locale text only on any second request (but session locale is changing immediately). Only way that I found includes:

$request->redirectUrl(...);

Is there more clear way?

Vitaliy-1 commented 7 years ago

Ahh, The problem can be managed by assigning values inside constructor of SessionManager class. Obviously session values can't be changed if already assigned, isn't it?

asmecher commented 7 years ago

@Vitaliy-1, rather than working via session parameters, I'd suggest adding a facility to the AppLocale class that permits setting the locale, rather than just getting it. This would involve moving the $currentLocale variable there out into the class, and adding a new setLocale function.

Vitaliy-1 commented 7 years ago

Thanks for guidance @asmecher

There is another one problem, after applying modifications as per your advice. The problem is that locale from all plugins don't want to change immediately after using setLocale method. They need session refreshment. But core locale is updating accordingly.

My AppLocale class: https://github.com/Vitaliy-1/AppLocale/blob/master/AppLocale.inc.php

This how I call setLocale method from a plugin: https://github.com/Vitaliy-1/localeRedirect/blob/master/LocaleRedirectPlugin.inc.php#L41

Vitaliy-1 commented 7 years ago

Hi @asmecher

I have managed to make a separate URL for non-primary locale. After looking over several options and reading google guidelines about multilanguage sites I pick up a variant with separate subdomain. It has no conflicts with main code, OJS picks requests to subdomains without a need to pointing them in the apache configuration files. Only subdomain registering is needed. Have checked on the production system and it works fine with already started and new user sessions. One problem was to make a switcher on a admin dashboard side, because standard tools for routing current location weren't working in usernav.tpl (as it is not actually a page), but it was managed with HTTP_REFERER and bit of regex.

But I wasn't able to code an appropriate setter for AppLocale class, so I have done the modification for SubmissionManager class - setting the currentLocale var for user session depending on presence of subdomain in URL.

Do you actually need this sort of a plugin for public use? If so, how can I manage a setter for changing languages?

Vitaliy-1 commented 6 years ago

Hi @asmecher So what about the idea to give separate subdomains for non-primary locales? We have successfully tested it for several months, and there weren't any disruptions in publication, indexing or XML exporting processes.

asmecher commented 6 years ago

@Vitaliy-1, sorry I haven't been following this as closely as I'd like. Subdomains would certainly solve the problem for some, though it's probably not a general-purpose enough solution for everyone (thinking e.g. of the many users who don't have their own domains or lack expertise in setting up subdomains). Can you summarize what was required to set this up (e.g. patches etc)?

ajnyga commented 6 years ago

Just dropping this here although it does include some obvious things: https://support.google.com/webmasters/answer/182192, most important part in the end.

Some suggestions with two locale journal. Default is English and secondary is Deutsch. Basically the default locale would also work if a locale existed in the URI, but would result into a redirect as suggested by Google in the above document. For claritys sake I am not showing the index.php part which many sites hide anyway.

Main site (or do we need these for the main site?)

Journal index:

Single article:

(edit: how come nobody has registered site.com?)

Vitaliy-1 commented 6 years ago

Hi @asmecher Nothing really special. Most of the modifications were done inside SessionManager class. Really wanted to add method inside AppLocale class, but encountered with problems, described above. My new static method:

private function subdomainLocaleRedirect(PKPRequest $request)
    {
        $domainLocalePointer = explode(".", $_SERVER['HTTP_HOST'])[0];
        $journal = $request->getJournal();
        $site = $request->getSite();

        // get supported locales and primary locale
        if ($journal != null) {
            $locales = $journal->getSupportedLocaleNames();
            $primaryLocale = $journal->getPrimaryLocale();
        } else {
            $locales = $site->getSupportedLocaleNames();
            $primaryLocale = $site->getPrimaryLocale();
        }

        // make an array where key is 2 first chars from supported locale and values - corresponding locale name
        foreach ($locales as $key => $supportedLocale) {
            if ($key != $primaryLocale) {
                $supportedLocalesforDomain[substr($key, 0, 2)] = $key;
            }
        }

        if (!isset($supportedLocalesforDomain)) return false;

        if ($this->userSession != null) {
            foreach ($supportedLocalesforDomain as $domainKey => $localeValue) {
                if ($domainLocalePointer != $domainKey && $this->userSession->getSessionVar("currentLocale") != $primaryLocale) {
                    $this->userSession->setSessionVar("currentLocale", $primaryLocale);
                } elseif ($domainLocalePointer == $domainKey && $this->userSession->getSessionVar("currentLocale") != $localeValue) {
                    $this->userSession->setSessionVar("currentLocale", $localeValue);
                }
            }
        }
    }

line 68 after $now = time();: $this->subdomainLocaleRedirect($request);

additional lines in case if user cookies not set. Need to be rewrited to retrieve actually installed locales. I have put it after a creation of a new session, after this line: $this->userSession->setSecondsLastUsed($now);:

$domainLocalePointer = explode(".", $_SERVER['HTTP_HOST'])[0];
            if ($domainLocalePointer == "uk") {
                $this->userSession->setSessionVar("currentLocale", "uk_UA");
            } else {
                $this->userSession->setSessionVar("currentLocale", "en_US");
            }

There is no need to add subdomain on a server level. OJS will serve these requests appropriately.

Vitaliy-1 commented 6 years ago

As I am not a programmer, think this code could be optimized :)

jmvezic commented 6 years ago

@Vitaliy-1 Was this issue ever pushed to OJS 2? Because I'm not getting all languages indexed in my installation either.

Vitaliy-1 commented 6 years ago

Think no. As I wrote here it is complex problem. Because crawlers are indexing only one locale per URL, there is a need to change the URL to every non-primary locale. I suppose, the best thing here is to give additional query string; and primary locale always should have default URL and be associated with indexing (DOI, PMID, OAI etc.)

jmvezic commented 6 years ago

You mean as a parameter or as an actual part of path? I tried your solution with SessionManager but it gives me a "too many redirects" error. That could be due to the fact I already have some redirects based on the country of visitor, though...

Vitaliy-1 commented 6 years ago

Suppose, as a parameter would be easier and I saw that this was implemented in minimum one OJS 2 journal. If you already have redirects its quite possible you need to rewrite that part of code. Keep in mind that approach with subdomain probably will require registering a subdomain.

jmvezic commented 6 years ago

Yeah, subdomain would actually be more difficult in my scenario, not because of the registering but because we also have an OMP installation and we plan on having a Wordpress installation as a base directory, all on the same domain.

In any case, I'll try doing a paramater and will report back here with how it goes.

jmvezic commented 6 years ago

@asmecher

What about using an optional URL parameter, e.g.: .../article/view/smecher17/pdf?locale=uk_UA? It's not as pretty as your proposal, but isn't ambiguous, and it should be clear to readers how it'll behave. To facilitate indexing, I would think the only additional thing that's needed is better linking to different-language versions, in the front end and probably also in meta content.

How should I go about doing this? In which file would it best to make these changes? I tried doing it in the header.tpl of a theme, but I get all sorts of errors (probably because I'm using RESTful URLs?)

I'm using 2.4.8.3, for reference.

Vitaliy-1 commented 6 years ago

@jmvezic I suppose you need to intercept web request on a higher level in the class that is responsible for handling them. In OJS 3 I worked with SessionManager class. Although, maybe this can be done through a plugin with an appropriate hook (I'm not sure). I modified templates only for pointing the right links to locales' pages.

jmvezic commented 6 years ago

@Vitaliy-1 I suppose I could have a go at it, I haven't interfered with classes yet for fear of breaking something. You think Google would "catch" the parameter if it was added dynamically through SessionManager?

Vitaliy-1 commented 6 years ago
  1. If a user is using specific locale, the URL string must include specific locale parameter.
  2. If a user goes to the link with specific locale parameter, the language should change appropriately.
  3. Suppose, primary locale shouldn't have optional locale parameter in order not to break indexing that all your articles already have. Or you should provide redirection. For example, if you have DOI's, after adding additional parameter they will point to the wrong URL.

I'm seeing it like this. This all can be put in PHP. Here you need to work with AppLocale class (in OJS3) and URL string from the request.

Google certainly will cache any parameter that you add to URL.

jmvezic commented 6 years ago

Okay, so I've made a bit of hack which could work in my case, I hope. I've added, in the article/header.tpl file of my theme the following code:

{php}
$AppLocale = new AppLocale();
$Locale = $AppLocale->getLocale();
if(!isset($_GET["lang"])){
header('Location: '."$_SERVER[REQUEST_URI]?lang=$Locale");die();
}
else {
if($_GET["lang"]!=$Locale){
header('Location: '.strtok($_SERVER["REQUEST_URI"],'?')."?lang=$Locale");die();
}
}
{/php}

What that does is redirect the article view page to the URL which ends with ?lang=en_US, for example http://www.site.com/journal/article/view/1402?lang=en_US. In case the user (or Googlebot for that matter) changes the language via the language picker, the parameter changes as well.

Here's hoping that Google will now see this as two seperate URLs and index both.

The same thing could be done for the journal index page as well, I presume.

Obviously this is a pretty dirty solution, and I've yet to see if it's going to work. If it works, I'll update here in case anyone needs a quick solution until a prettier/more global one arrives.

jmvezic commented 6 years ago

So a little update: it seems that the above solution isn't working, or rather, Google doesn't follow the language switch redirect. It just throws a redirect error and says "excluded".

fgnievinski commented 2 years ago

while reading issue #7272 , it occurred to me the present issue (basically, offering language-specific article URLs) could perhaps be implemented in OJS3 as a theme. the information needed seems to have already been made available in the "smarty" template API:

(string) $currentLocale is the locale (language) the site is currently being viewed in. You’ll find an array of supported locales at $supportedLocales.

https://docs.pkp.sfu.ca/pkp-theming-guide/en/template-variables#site-journal-and-locale

the above could be used in conjunction with the "currentUrl" variable to extract an input URL GET parameter for the language code (e.g., hl=en):

https://docs.pkp.sfu.ca/pkp-theming-guide/en/advanced-custom-data

PS: maybe the present issue #699 should be renamed to something more specific, like "Offer language-specific article URLs", as web indexing is a broader issue, with other potential solutions, such as showing multilingual metadata on the same page #7272.

fgnievinski commented 2 years ago

leaving here some relevant guidelines for SEO:

Use different URLs for different language versions:

Google recommends using different URLs for each language version of a page rather than using cookies or browser settings to adjust the content language on the page.

If you use different URLs for different languages, use hreflang annotations to help Google search results link to the correct language version of a page.

Use the x-default tag for unmatched languages:

The reserved hreflang="x-default" value is used when no other language/region matches the user's browser setting. This value is optional, but recommended, as a way for you to control the page when no languages match. A good use is to target your site's homepage where there is a clickable map that enables the user to select their country.

Example:

Here is the HTML that would be in the section of all the pages listed above. It would direct US, UK, generic English speakers, and German speakers to localized pages, and all others to a generic homepage. Google Search returns the appropriate result for the user, according to their browser setting

<head>
 <title>Widgets, Inc</title>
  <link rel="alternate" hreflang="en-gb"
       href="http://en-gb.example.com/page.html" />
  <link rel="alternate" hreflang="en-us"
       href="http://en-us.example.com/page.html" />
  <link rel="alternate" hreflang="en"
       href="http://en.example.com/page.html" />
  <link rel="alternate" hreflang="de"
       href="http://de.example.com/page.html" />
 <link rel="alternate" hreflang="x-default"
       href="http://www.example.com/" />
</head>

Mistakes to Avoid when Auto-redirecting: example

  • Use separate redirector pages solely for redirecting. Use 1 redirector page for each set of internationalized pages. In the example above, http://www.example.com/product is the redirector page for the set of 3 pages http://www.example.com/en/product.html, /fr/product.html and /es/product.html. (...)
  • Never automatically redirect a visitor (human or bot) that is trying to access a specific language version page that has content. In our example, that means never auto-redirecting when the page requested is one of http://www.example.com/en/product.html, /fr/product.html or /es/product.html
jyhein commented 7 months ago

Issue description

The currently active language is stored in a cookie on the user's device and not in the URL. There are no URL's available for crawlers to follow while indexing the content. As a result, a search engine crawler can only index the article metadata using one language. The same applies to other content in the journal's homepage, for example the About the Journal section. The aim is to enable crawlers to find and index multilingual content in the journal homepage. This includes both a) multilingual article and issue metadata and b) other text content on the journal homepage.

Changes

Questions

Q: A:

The attached PR's includes the following code changes:

PRs: PKP: https://github.com/pkp/pkp-lib/pull/9628 OJS: https://github.com/pkp/ojs/pull/4146 OMP: https://github.com/pkp/omp/pull/1545 OPS: https://github.com/pkp/ops/pull/659 Crossref-ojs: https://github.com/pkp/crossref-ojs/pull/47 Crossref-ops: https://github.com/pkp/crossref-ops/pull/37 CitationStyleLanguage: https://github.com/pkp/citationStyleLanguage/pull/119 GoogleScholar: https://github.com/pkp/googleScholar/pull/19

Fix 1, s. comment https://github.com/pkp/pkp-lib/issues/699#issuecomment-2083802046 below: PRs:

Fix 2 (in monolingual contexts, if one adds locale in the URL, then the URL changes to something like this context//something): PRs:

bozana commented 4 months ago

@jonasraoni, I would have a question regarding the WebFeed plugin: currently, with this implementation, the URLs in atom, rss, and rss2 would have the UI language in the URLs, e.g. in atom:

<id>http://ojs-dev.bb/index.php/publicknowledge/fr_CA/gateway/plugin/WebFeedGatewayPlugin/atom</id>
...
<link rel="alternate" href="http://ojs-dev.bb/index.php/publicknowledge/fr_CA" />
<link rel="self" type="application/atom+xml" href="http://ojs-dev.bb/index.php/publicknowledge/fr_CA/gateway/plugin/WebFeedGatewayPlugin/atom" />
...
<link rel="alternate" href="http://ojs-dev.bb/index.php/publicknowledge/fr_CA/article/view/17" />
<summary type="html" xml:base="http://ojs-dev.bb/index.php/publicknowledge/fr_CA/article/view/17">

This seems to be OK for me -- all data is presented localized, and a user could so maybe choose in which language he/she would like to read/get the feeds. Only one thing that we maybe need to change in that case is language element in rss and rss2 -- it always shows the journal primary language. Do you think this could/should be then also changed to the UI language? -- I am not 100% sure what the language element needs to contain... Or, do you think we should for a reason rather keep the old, normal URLs, without the UI language in them? :thinking: Thanks a lot!

jonasraoni commented 4 months ago

@bozana We can use the same format that we use on the <html> tag (source: https://www.rssboard.org/rss-language-codes), and the ATOM <feed> tag also supports the xml:lang="en" attribute.

jonasraoni commented 4 months ago

I didn't check the PRs, but it's good to ensure that old links are being properly redirected.

ajnyga commented 4 months ago

The old url's work and these are and should be used when article metadata is exported somewhere, like for example Crossref and DOAJ or shown in OAI-PMH, because we of course can not know how the journal changes their settings. RSS feeds are probably, like Bozana is thinking, different in this regard.

bozana commented 4 months ago

Hi @jonasraoni, as Antti-Jussi said, the old links will work. However, the new WebFeed URLs will contain the UI language, as in the example above. According to the https://www.rssboard.org/rss-language-codes:

The language employed in an RSS feed can be indicated in the language element,...

the language element should then also contain the UI language in the format ISO 639-1.

EDIT: The issue that should address this: https://github.com/pkp/pkp-lib/issues/9910

bozana commented 4 months ago

Hi @jyhein, I took a look into the code once again and it looks good. Just that OMP and OPS are missing one change -- I left a comment in the PRs. Regarding ORCID: Because it is currently being moved into the core, could you only provide the links to you changes in this issue, so that @ewhanson can consider them there: https://github.com/pkp/pkp-lib/issues/9771. Else, you do not need to link to them in your PRs here. Then, you can rebase everything (also the plugin submodules), create PRs for plugin submodules/repositories (and link to the PRs here in this issue above), and consider all submodules (pkp, but also every plugin submodule) in the last commit. Then, when the tests pass we can merge... :-) Thanks a lot!

bozana commented 4 months ago

Hi @jyhein (and maybe @ajnyga), what about sitemap -- does it need to contain all languages? -- s. https://developers.google.com/search/docs/specialty/international/localized-versions.

ajnyga commented 4 months ago

My thinking was that the sitemap would guide to the primary language (via the link without the language code) and each page would have further information for search engines in the page header.

But of course adding the links to that sitemap would be doable. Just leads to a massive sitemap of course in some cases.

bozana commented 4 months ago

Yes, lets leave it as it is for now... Also, as @jyhein said, it seems, only one way from 3 listed in that Google page needs to be supported... Thanks a lot!

bozana commented 4 months ago

All merged, thanks a lot!

asmecher commented 4 months ago

@bozana / @jyhein, I'm re-opening this because it breaks my installation (specifically https://github.com/pkp/pkp-lib/pull/9628). My local OJS is installed to http://localhost/git/ojs-main, and a typical URL into OJS is http://localhost/git/ojs-main/index.php/publicknowledge/article/view/mwandenga-signalling-theory.

With the PR applied, the path gets mixed into the path_info data. Going to http://localhost/git/ojs-main redirects me to http://localhost/git/ojs-main/index.php/git/ojs-main, and going to http://localhost/git/ojs-main/index.php/publicknowledge/article/view/mwandenga-signalling-theory redirects me to http://localhost/git/ojs-main/index.php/git/ojs-main/publicknowledge/article/view/mwandenga-signalling-theory.

The /git/ojs-main part after the index.php should not be there -- it's the installation directory and is already there before the index.php wrapper.

Can you test with the case where OJS is not installed in the server's root directory?

bozana commented 4 months ago

@asmecher, I have just merged the fix that @jyhein provided, so your installation should work correctly with the new code... :-)

asmecher commented 4 months ago

That works -- thanks, @jyhein and @bozana!