sul-dlss / purl

URL resolver that translates a reference to a digital object in the form of a druid, into a full content representation of that object as available
Other
8 stars 1 forks source link

Confirm how to get Purls indexed in Google Scholar #818

Closed lwrubel closed 11 months ago

lwrubel commented 11 months ago

Review Google Scholar's Inclusion Guidelines for Webmasters (see Crawl section in particular) and identify any further steps we need to take to get repository content indexed.

peetucket commented 11 months ago

From the criteria, will analyze in comment:

Individual Authors:

If you're an individual author, it works best to simply upload your paper to your website, e.g., www.example.edu/~professor/jpdr2009.pdf; and add a link to it on your publications page, such as www.example.edu/~professor/publications.html. Make sure that:

    the full text of your paper is in a PDF file that ends with ".pdf",
    the title of the paper appears in a large font on top of the first page,
    the authors of the paper are listed right below the title on a separate line, and
    there's a bibliography section titled, e.g., "References" or "Bibliography" at the end.

That's it! Our search robots should normally find your paper and include it in Google Scholar within several weeks.

If it doesn't work, you could either (1) read more detailed technical guidelines in this documentation or (2) check if your local institutional repository is already configured for indexing in Google Scholar, and upload your papers there.

University Repositories

If you're a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.

If you use a less common hosting product or service, or an older version of these, please read this entire documentation and make sure that your website meets our technical guidelines. 
peetucket commented 11 months ago

Assume all crawling is done on PURL. Example PURL with a PDF: https://purl.stanford.edu/bb007hx5508

Yes, but there can be other PDFs that are not the paper.

PURL Page: Yes, it does. It's in an H1 tag. PDF: Yes, I can expect it will for manuscript.

PURL: No, the authors are not right below. They are in the "Creators" section, which is under Access and Description. PDF: Probably (for manuscript PDFs, which is what we are concerned about).

PURL: No. PDF: Probably (for manuscript PDFs, which is what we are concerned about).

peetucket commented 11 months ago

Nope, we don't use any of these.

OK Google.

peetucket commented 11 months ago

Summary:

lwrubel commented 11 months ago

I don't think it does, but could you confirm nothing changes with the new Purl design? I see it does not move the authors up under the title, and I think that's the only relevant layout aspect.

Are there PDFs that would be relevant to Google Scholar which use the PDF viewer instead of the file viewer, and if so, does that have any effect on the criteria? The PDF viewer gets used for content type of document. (See the 16,580 of them in Argo).

Here's an example that I think is something we'd want in Google Scholar: https://purl.stanford.edu/bf313fs1595. It's a Stanford law journal article and Google Scholar only shows the subscription HeinOnline version currently.

peetucket commented 11 months ago

I don't think it does, but could you confirm nothing changes with the new Purl design? I see it does not move the authors up under the title, and I think that's the only relevant layout aspect.

It shouldn't. I think the "authors below the title" requirement is actually for the OCRed PDF itself and not the webpage.

Are there PDFs that would be relevant to Google Scholar which use the PDF viewer instead of the file viewer, and if so, does that have any effect on the criteria? The PDF viewer gets used for content type of document. (See the 16,580 of them in Argo).

Since the PDF viewer still provides a link to download the PDF file itself (e.g. see https://purl.stanford.edu/bf313fs1595 or https://purl.stanford.edu/bb009sj6832) in the download panel, it should still be OK (assuming Google just scans the HTML page for links to the PDF).

But it's hard to know exactly how Google's crawlers will deal with it until we open the doors and let them into PURL.

I will add one more bullet point to my summary results above, which is that I am quite certain the PDF file needs to have text within it (either because it was converted from a Word or other format that had the text already, or it was OCRed).

andrewjbtw commented 11 months ago

Do we have a robots exclusion on Stacks? Or anything that would block a crawler from coming in from the purl to get the PDF?

peetucket commented 11 months ago

Do we have a robots exclusion on Stacks? Or anything that would block a crawler from coming in from the purl to get the PDF?

Stacks appears to be wide open for crawling (see below) but I believe that even if it was blocked, the spider should be able to follow the PDF link anyway, since that's not really a blind crawl, but rather just following an inbound link.

https://stacks.stanford.edu/robots.txt

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /