Confirm how to get Purls indexed in Google Scholar

lwrubel commented 1 year ago

Review Google Scholar's Inclusion Guidelines for Webmasters (see Crawl section in particular) and identify any further steps we need to take to get repository content indexed.

peetucket commented 1 year ago

From the criteria, will analyze in comment:

Individual Authors:

If you're an individual author, it works best to simply upload your paper to your website, e.g., www.example.edu/~professor/jpdr2009.pdf; and add a link to it on your publications page, such as www.example.edu/~professor/publications.html. Make sure that:

    the full text of your paper is in a PDF file that ends with ".pdf",
    the title of the paper appears in a large font on top of the first page,
    the authors of the paper are listed right below the title on a separate line, and
    there's a bibliography section titled, e.g., "References" or "Bibliography" at the end.

That's it! Our search robots should normally find your paper and include it in Google Scholar within several weeks.

If it doesn't work, you could either (1) read more detailed technical guidelines in this documentation or (2) check if your local institutional repository is already configured for indexing in Google Scholar, and upload your papers there.

University Repositories

If you're a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.

If you use a less common hosting product or service, or an older version of these, please read this entire documentation and make sure that your website meets our technical guidelines.

peetucket commented 1 year ago

Assume all crawling is done on PURL. Example PURL with a PDF: https://purl.stanford.edu/bb007hx5508

the full text of your paper is in a PDF file that ends with ".pdf"

Yes, but there can be other PDFs that are not the paper.

the title of the paper appears in a large font on top of the first page

PURL Page: Yes, it does. It's in an H1 tag. PDF: Yes, I can expect it will for manuscript.

the authors of the paper are listed right below the title on a separate line

PURL: No, the authors are not right below. They are in the "Creators" section, which is under Access and Description. PDF: Probably (for manuscript PDFs, which is what we are concerned about).

there's a bibliography section titled, e.g., "References" or "Bibliography" at the end

PURL: No. PDF: Probably (for manuscript PDFs, which is what we are concerned about).

peetucket commented 1 year ago

If you're a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.

Nope, we don't use any of these.

If you use a less common hosting product or service, or an older version of these, please read this entire documentation and make sure that your website meets our technical guidelines.

OK Google.

peetucket commented 1 year ago

Summary:

For PURLs with a single PDF for which the PDF is a manuscript, indexing into Google Scholar should most likely work. Assuming the PURL page is indexed (i.e. not blocked by robots.txt)
For PURLs with multiple PDFs for which one of the PDFs is a manuscript, indexing into Google School may work. Assuming the PURL page is indexed (i.e. not blocked by robots.txt)
For PURLs with no PDFs, nothing will happen.
If the PDF of the manuscript on PURL is in a non-standard format (i.e. first page is not a title page with authors below and there are no references or bibliography), it will be ignored by Google Scholar.
Any PDFs file likely needs to have text within it (either because it was converted from a Word or other format that had the text already, or it was OCRed).

lwrubel commented 1 year ago

I don't think it does, but could you confirm nothing changes with the new Purl design? I see it does not move the authors up under the title, and I think that's the only relevant layout aspect.

Are there PDFs that would be relevant to Google Scholar which use the PDF viewer instead of the file viewer, and if so, does that have any effect on the criteria? The PDF viewer gets used for content type of document. (See the 16,580 of them in Argo).

Here's an example that I think is something we'd want in Google Scholar: https://purl.stanford.edu/bf313fs1595. It's a Stanford law journal article and Google Scholar only shows the subscription HeinOnline version currently.

peetucket commented 1 year ago

I don't think it does, but could you confirm nothing changes with the new Purl design? I see it does not move the authors up under the title, and I think that's the only relevant layout aspect.

It shouldn't. I think the "authors below the title" requirement is actually for the OCRed PDF itself and not the webpage.

Are there PDFs that would be relevant to Google Scholar which use the PDF viewer instead of the file viewer, and if so, does that have any effect on the criteria? The PDF viewer gets used for content type of document. (See the 16,580 of them in Argo).

Since the PDF viewer still provides a link to download the PDF file itself (e.g. see https://purl.stanford.edu/bf313fs1595 or https://purl.stanford.edu/bb009sj6832) in the download panel, it should still be OK (assuming Google just scans the HTML page for links to the PDF).

But it's hard to know exactly how Google's crawlers will deal with it until we open the doors and let them into PURL.

I will add one more bullet point to my summary results above, which is that I am quite certain the PDF file needs to have text within it (either because it was converted from a Word or other format that had the text already, or it was OCRed).

andrewjbtw commented 1 year ago

Do we have a robots exclusion on Stacks? Or anything that would block a crawler from coming in from the purl to get the PDF?

peetucket commented 12 months ago

Do we have a robots exclusion on Stacks? Or anything that would block a crawler from coming in from the purl to get the PDF?

Stacks appears to be wide open for crawling (see below) but I believe that even if it was blocked, the spider should be able to follow the PDF link anyway, since that's not really a blind crawl, but rather just following an inbound link.

https://stacks.stanford.edu/robots.txt

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /

sul-dlss / purl

Confirm how to get Purls indexed in Google Scholar #818