pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
304 stars 444 forks source link

OMP Chapter - Googlescholar support #7003

Open withanage opened 3 years ago

withanage commented 3 years ago

Introduction

Edited volumes require chapter landing pages for exposing metadata and content to UI users, indexing systems and archiving systems and other purposes.

Requirements from Google scholar

Background

Google scholar (GS) drives end-users to compontent level e.g chapters. GS needs an indication in URL marker to determine the type of the standalone content.

For monographs, standalone piece of content is the book content and metadata. Chapters are not indexed.

For chapters, standalone content are chapters, GS Indexing System (IS) matches full-text for each chapter and drives the end-users to the chapters thus increasing ranking.

Requirements

GS IS needs an indication, what type of item using URL Path (not as parameters) and a top level landing page.

Possible solution after discssions @withanage , @ajnyga @mwestin-googlescholar and @gemusehandler

We allow an alternative path for indicating the type. e.g.

catalog/book/19 will always lead to book 19 as in the current implementation

Additionally url-marked paths will point gs the type of the content.

catalog/monograph/19 if monograph
catalog/edited-volume/19 if edited volume
catalog/edited-volume/19/chapter625

Robots.txt needs a URL map indicating those alternative paths.

User-agent: *
Disallow: cache/
Sitemap: https://mypress.xyz/index.php/books/sitemap

Related ticket

Previous work by @ajnyga on chapters

Reference implementation by Ubiquity Press to support GS

Base URL : https://www.ubiquitypress.com/site/books/e/10.5334/bcj Alternaticve URL: https://www.ubiquitypress.com/site/books/e/10.5334/bcg/ e means Edited Volume

withanage commented 3 years ago

Hi @asmecher and @NateWr , I have documented the status of the discussion with Google Scholar and a possible solution to resolve the path requirement defined by GS. Please let us know your thoughts on this ?

@ajnyga @mwestin-googlescholar and @gemusehandler Please add anything, if I forgot from the discussion.

ajnyga commented 3 years ago

I think we have three separate issues here:

  1. sitemap to robots.txt (I already made an issue about that here https://github.com/pkp/pkp-lib/issues/6965)
  2. chapter landing page and url structure
  3. changing the book landing url to include a reference to the book type

For 2 my suggestion would be to add an independent handler that would look like catalog/chapter/id The main argument for this is that we need to also support versioning in the url and having both the path to the book and the chapter in the same url will make this difficult. With an independent chapter url structure, we would already have a solution to this: we would just use the same logic we use in article and book handlers.

Concerning 3, Ubiquity seems to redirect url's. For example https://www.ubiquitypress.com/site/books/10.5334/bcj is used in the navigation and is registered to crossref DOIs and also probably to other places. But clicking it will redirect to https://www.ubiquitypress.com/site/books/e/10.5334/bcj thus giving google the information they need.

We have asked whether this data could be in metadata of the html page, but google informed us that they need the information before they start to read the page content.

ajnyga commented 3 years ago

Older issue on landing pages here: https://github.com/pkp/pkp-lib/issues/5280

NateWr commented 3 years ago

I'm sorry but I think we need to push back on the URL requirements to distinguish between monographs and edited volumes. GoogleScholar should not be dictating the URL structure of any site. They can ask for a HTTP address for every link they want to provide (eg - a URL for every chapter or galley), but setting requirements on the structure of the URL, beyond the constraints of the DNS system, violates the core principles of HTTP.

withanage commented 3 years ago

Hi @NateWr ,

May be , the way I wrote the specs made the misunderstanding that google scholar gave the structure. I and @ajnyga thought, that can be a proposal. GS actually wanted a markup in the url to indicate that is a chapter / edited volume or a monograph and their only restriction is they can't distinguish the information from the metadata. We are surely free to decide the structure of the url and have to communicate to them. I am very sorry for the misunderstanding.

ghost commented 3 years ago

Hi @NateWr ,

Many platforms and publishers, from Ubiquity to MIT Press and Taylor & Francis, have added URL markers distinguishing chapters from books and standalone monographs from edited volumes where chapters are standalone content. The string is entirely up to the publisher/platform-- I had suggested what other publishers have put in place that works well for indexing.

Adding various path markers to item-level URLs is totally unrelated to HTTP protocols. We see many different types of URL markers for publisher sites, both related and unrelated to indexing.

Cheers, Monica

gemusehandler commented 3 years ago

Hello @NateWr, @mwestin-googlescholar and @withanage,

Nate, I respect your position. If we stick to it (Google Scholar not telling us what to do) then I see two simple solutions.

In the administration I could use "monograph" as path. Using the same url I could add a second press where I would use "edited-volume". It is a bit clumsy to have two presses where it could be one but hey! It would work, right? See here the simple "monograph"-path solution: https://bookrxiv.ac/index.php/monograph/index

But I see another solution as well. I spotted that the PDF reader add some additional information to the download URL. There is ?inline=1 at the end. Like: https://bookrxiv.ac/index.php/monograph/catalog/download/6/16/16?inline=1

I played around a bit with the Google Scholar plugin and I notice that I can add _monograph to the download link as well. My browser doesn't warn of additional errors and the link that is produced works as it should. the displays:

If this would indeed work then we need some code that tells the plugin: if worktype=1 add _edited-monograph, if workType=2 add _monograph

I get the feeling that this issue can be resolved within the plugin...

However, now I bumped into something strange The first books I uploaded to the OMP (lower numbers) are doing fine.

But the URLs that appear in the "meta name" from book 12 on and higher produce a 404 page.

So I have two questions. Is the way that the download link is generated in the Scholar Plugin correct? If not it could explain some of our problems. Could the adding of _monograph or _edited-volume work as a solution?

NateWr commented 3 years ago

No apology needed, @withanage, your description was perfectly clear. My concern is not with the specific proposal (monograph vs edited-volume), but with the requirement itself.

URLs are an important part of UX, and something that we are actively trying to improve (#5932). URLs need to be clear and concise for humans, and Google Scholar's insistence on a fixed string to parse data from a URL is not compatible with what we need to do for our community.

At the moment, our URLs are hard-coded in English. But this will not always be the case. A Spanish journal or press needs to be able to run their site with URLs in Spanish. Eventually, we will support localised routes, such as /books/1 becoming /libros/1. We also want to allow journals and presses to use the URL paths that best describe their content. An English-language journal may want to have URLs for their published articles at /dispatches/1 or /reports/1 instead of /articles/1.

In the long-term, it is not viable for us to fix the URL structure to an English language word (/edited-volume/). The alternative, using an arbitrary marker (/book/e/1), may be a temporary solution, but it is a form of URL obfuscation that runs against the UX we are working towards (URL paths should be comprehensible and navigable).

The expectation that Google Scholar can identify a resource from a URL structure is premised on a publishing oligopoly in which there are a small number of publishing systems with fixed URLs. Unfortunately, this is not compatible with our vision of a distributed, multilingual publishing infrastructure.

@mwestin-googlescholar it's my understanding from talking with colleagues that Google Scholar is not willing to budge on this point. But from my perspective the clock is running out. If GS is unwilling to prioritize adaptations that are important to us, for example removing the requirement to include index.php, which makes OJS journals look like blogs from 2004, I'm not sure what else we can do (this is a serious reputational issue for us). We'd be willing to bend over backwards to build tools to support site registration like Google Webmaster Tools or depositing with Google Scholar using the kind of distributed API-driven architecture that has become standard in scholarly publishing. But we need to move on eventually.

ghost commented 3 years ago

Hi all-- totally understood if this isn't a feature you want to pursue for OMP. As I had mentioned, the string for URL markers for book indexing could be any string you like, i.e. no need for using any particular language.

If you'd ever like to explore this as a future feature for OMP, just let me know.

Cheers, Monica

asmecher commented 3 years ago

@mwestin-googlescholar, this is a feature we'd like to pursue -- but we're looking for alternatives to deriving the kind of information you need from URLs, when that's a big imposition on our community and is at odds with longer-term goals. We can't and don't want to exert control over our community; the index.php marker is the best example of this -- our users understandably want to remove that and we can't stop them. Is there some alternative to building this data into URLs, e.g. an API-based exchange, meta tags, XML site map, etc?

ghost commented 3 years ago

Hi @asmecher,

Happy to speak further about this, but the short version is that the only want to set this up well and consistently is within book/ chapter level URLs themselves. To handle different types of books, the indexing system needs to know how to treat the item before it analyzes the metadata.

If this ends up being a direction your community wants to go in, happy to pick this conversation back up.

marcbria commented 3 years ago

Hi @mwestin-googlescholar,

I'm following this thread with much attention.

I have been an active member of the PKP community for the last decade which let me join the PKP technology committee as a "at-large" member. I provide support on local networks (spain) as well as in latin american ones, and I offer support in forum to the community.

I don't intend to present a CV here, but I thought it was necessary to explain where I am speaking from when I say that I know the needs of the community well.

Said that, I want to note that there is no IF is that "this ends up being a direction your community wants to go in". I mean, it is simply something that is already happening, so I would ask that this conversation not be delayed any further and support the line of work that Alec and Nate are proposing.

ghost commented 3 years ago

HI @marcbria,

That is great to hear! Just to be crystal clear, this direction would need to involve URL markers in the URL paths themselves. As I mentioned a few times, always happy to continue the conversation.

marcbria commented 3 years ago

Sorry @mwestin-googlescholar but my English is not good and I'm probably missing something.

Does it means that Google is telling the PKP community how things need to be done?

Because "need to involve" does not sound like a good start for an open conversation.

ghost commented 3 years ago

I think we're talking in circles a bit here. Let me put it a different way: to systematically and accurately distinguish chapters from books, and different book types from each other, the indexing system can only work with URL markers. Unfortunately there isn't another way that works well. I wish there were, as I know adding URL markers can be difficult for a few different reasons.

To be clear: the indexing system will still index OMP publications without these URL markers. This is a potential future improvement project to further refine how we work together to index books and chapters of different kinds.

There is no pressure at all from my side to implement this refinement - I am sharing answers to questions I've been asked (so the comment above feels a little uncomfortable to me).

I hope that clears things up.

ajnyga commented 3 years ago

I have been looking how Springer, Elsevier and Taylor and Francis work with their url's. I will divide the solutions the few categories and make a suggestion how we could work in OMP.

URL markers

Springer and Elsevier do not seem to add url markers concerning the book type, or I could not find them. Springer (Springer link) and Elsevier (Science Direct) have independent handlers for showing single chapters. Springer: /book/id / /chapter/id Elsevier: /book/id / /science/article/id.

Taylor and Francis uses /edit/ and /mono/and also /oa-edit/ to distinguish between different book types. These are visible both in the url's leading to the book landing page and the chapter landing pages. Books have an url like /books/edit/ id and chapters /chapters/edit/id.

Metadata tags (Highwire Press)

Taylor & Francis will add metadata tags to landing pages:

Effectively this is saying, "for monographs just index the main book page and for edited volumes just index the chapter pages". I will use this idea in my solution below.

Elsevier And Springer seem to never add metadata tags for book main landing pages both in case of monographs and edited volumes, just for chapters regardless if it is a monograph or an edited volume.

Elsevier also adds <meta name="robots" content="INDEX,FOLLOW,NOARCHIVE,NOODP,NOYDIR" /> to book and chapter pages.

All publishers use these (among others) on chapter landing pages to determine that this is a chapter and the relation to a book:

<meta name="citation_firstpage" content="1" />
<meta name="citation_lastpage" content="2" />
<meta name="citation_inbook_title" content="Book title" />
<meta name="citation_title" content="Chapter title" />

And Elsevier also uses these to define the type of the chapter (could not find anything similar from others):

<meta name="citation_type" content="CHP" />
<meta name="citation_chapter_type" content="edited-volume" />
<meta name="citation_article_type" content="Simple chapter" />

Sitemaps

Although Taylor & Francis does not add metatags to monograph chapters, interestingly in the sitemap they only seem to list chapter url’s also for monographs. I could not find any direct url's leading to book landing pages at all. But there are several sitemaps, so could be that I just did not find the right one.

Springer sitemap is has url's leading to book main landing pages. Could not find links leading to single chapters. For Springer Link I could not find a sitemap, so this is the situation with the main Springer site.

Elsevier sitemap has url's leading to both books and single articles/chapters.

Suggestion

So what I have heard is that the problem in OMP indexing lies in the fact we do not distinguish between Monographs and Edited volumes and Google does not know how to handle their chapters. I have understood that indexing monographs is working already but for edited volumes Google would like to only serve hits to single articles/chapters. Altering the url structure is the solution provided by google to solve this and is used by Taylor and Francis.

However, they also have other solutions in place, namely the way they show metadata in different cases. My suggestion is that OMP would work like this.

OMP core functionalities

OMP Google Scholar plugin functionalities

This should lead to a situation where for monographs only the book main page is being indexed and for edited volumes only the chapters are indexed.

edit: of course if I have misunderstood some of the requirements, let me know

NateWr commented 3 years ago

This is great, thanks @ajnyga! So if I understand correctly, the key distinction is where the Google Scholar meta tags are placed. For monographs, they're only placed on the book landing page. For edited volumes, they're only placed on the chapter pages. Is that right?

For monograph chapter landing pages, add a robots rule that will prevent them being indexed.

I don't think we should do this because every public URL should be indexable by general search engines. What Google Scholar needs shouldn't override the general compatibility with search engines.

Maybe this has already been discussed, but chapter pages are going to be opt-in for monographs, right? So by default my monograph will only have the one page (/book/1), but I can enable a page for individual chapters if I want.

ajnyga commented 3 years ago

Yes, exactly. This is the way F&T is handling it. Besides having the url markers. Elsevier and Springer have tags just in chapter pages for both book types and never in the book landing page. I think this means that they are targeting Google Scholar indexing this way and just try to make sure individual chapters end up there.

Yes, I am aware of the downsides of robots rules. We could of course limit them to googlebot. Too bad there is no separate bot rule for google scholar. It would make this very easy and exact.

Making chapter landing pages optional for monographs is probably a good idea. Many monographs might just have the chapter metadata available, but no chapter specific full text.

nongenti commented 3 years ago

At the UB Heidelberg we're working on a chapter landing page-plugin for OMP 3.3. For now all chapters with a 'deposited' or 'marked registered ' DOI get automaticly a own landing page. But we're discussing make this configurable.

The path is /book/book_id/c+chapter_id. Maybe we can make the part after the book_id configurable too.

Without these configurable parts we have a working solution on our development system. Only the template is not finished by now.

NateWr commented 3 years ago

That sounds interesting, @nongenti. I'd recommend using a full chapter path /book/{id}/chapter/{cid} or /book/{id}/c/{cid} instead of c+{cid}. That will match the existing structure for versions (/book/{id}/version/{vid}) and keep the path parts independent.