Index datasets with google's dataset search

jsheunis commented 9 months ago

See: https://github.com/datalad/datalad-catalog/issues/20

jsheunis commented 9 months ago

The PR added the components that were supposedly necessary for supporting rich results and page indexing, but this didn't seem to have the desired effect. After the merged PR, the data.sfb1451.de site was added to google's search console, the site was verified (by adding a custom meta tag to the index.html page), and the sitemap (sitemap.txt) was submitted via the search console to allow indexing of the list of pages available at the catalog site.

As of the time of this comment, the indexing process is still busy. Also, using the rich results test, none of the tested URLs (from the list in the sitemap, i.e. live pages on the catalog site) seem to return a successful check for rich results, even though the pages in question do show the necessary structured data in a script tag in the head of the html document, when inspecting via dev tools in the browser.

After this became evident, there followed a lot of debugging. The experienced issues suggested that the adding of the sitemap.txt file and the robots.txt file to the catalog site might have caused the failed rich results test on specific pages. To support this point, another staging site (jsheunis.github.io/sfb145-projects-catalog) returned a successful result on the test without having the text files committed. But two points seem to speak against this assessment:

the sitemap.txt file of the catalog site submitted via the search console was processed successfully without errors, and all URLs were recognised (94 of them). i.e. it doesn't seem to be an issue with the sitemap file itself (or the txt format as opposed to the more ubiquitous xml format)
The text files were removed from the data.sfb1451.de site and rich results tests were done anew, and these all failed.

So, some more investigation followed. I looked into the possibility of the particular framework, VueJS and Vue-router, causing these failures in the rich results tests. There's some online comments about latency in client-side rendering that's possibly causing google's crawlers not to render and index the pages on first try, and that they have to come back at a later stage to do so (if/when they can). The latency could be a result of javascript execution and front-end rendering, or asynchronous data fetching calls, or both. Both are applicable in the case of the vuejs app. Perhaps the rich results tests struggle to get all necessary information on first try in this scenario.

More investigation showed that there are known issues with search-engine-optimization processes (crawlers / indexing) and single-page-apps such as the VueJS application of the catalog site (e.g.: https://madewithvuejs.com/blog/how-to-make-vue-js-single-page-applications-seo-friendly-a-beginner-s-guide). Suggestions to mitigate this include using "history mode" for vue-router, using server-side rendering (https://vuejs.org/guide/scaling-up/ssr.html), prebuilding the entire application into a set of static html files (using e.g. nuxt.js).

Some investigations into history mode brought this issue to the front: https://stackoverflow.com/questions/65501787/vue-router-github-pages-and-custom-domain-not-working-with-routed-links. In my own local tests with history mode, I encountered the same issue of getting a 404 when the url contains parameters that should open a component view in the browser.

Some relevant links/issues:

https://github.com/dandi/dandi-archive/issues/785

To be continued

jsheunis commented 9 months ago

A new day, a new surprise. Here are several recent successful tests for pages with rich results on the https://data.sfb1451.de site:

From this, I will assume that the component that loads structured data into a dataset page works for now (although it could probably still be improved wrt to vuejs rendering latency, asynchronous calls, history mode etc). I am, however, not sure why it works. It could be a delayed result of removing the sitemap and robots files. I doubt that. It could be google's tests that are erratic. I don't know. It could be some interference of the request to start indexing the site with the live tests for rich results?

Anyway, IMO the next step is to wait for the indexing to finish before we can test the google dataset search.

What I will work on in the mean time is figuring out if a sitemap is necessary (or just useful) for the crawlers to index the catalog's pages, or if an alternative is possible that doesn't require a registry of all pages in a catalog to be maintained, since that would be counter to the idea of a decentralized catalog that can be contributed to without an overseeing maintainer.

jsheunis commented 9 months ago

A report from lighthouse:

Lighthouse Report Viewer.pdf

Some screenshots:

jsheunis commented 9 months ago

Making links crawlable: https://developers.google.com/search/docs/crawling-indexing/links-crawlable

So basically they need an href, and many/most pf the links in a catalog do not have those since they use javascript onclick or vuejs @click.

This content about SEO and Javascript looks very applicable:

https://developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics#use-history-api

For single-page applications with client-side routing, use the History API to implement routing between different views of your web app. To ensure that Googlebot can parse and extract your URLs, avoid using fragments to load different page content.

function goToPage(event) {
  event.preventDefault(); // stop the browser from navigating to the destination URL.
  const hrefUrl = event.target.getAttribute('href');
  const pageToLoad = hrefUrl.slice(1); // remove the leading slash
  document.getElementById('placeholder').innerHTML = load(pageToLoad);
  window.history.pushState({}, window.title, hrefUrl) // Update URL as well as browser history.
}

// Enable client-side routing for all links on the page
document.querySelectorAll('a').forEach(link => link.addEventListener('click', goToPage));

What I derive from the above is:

use Vuejs history mode (which uses the history API); this means we need to find a fix for the github/redirection issue 🥲
links should always have the href property, and the onclick event can use preventdefault and then do whatever it needs to do

jsheunis commented 9 months ago

Insightful thread about github and single-page-apps and redirects: https://github.com/isaacs/github/issues/408. Might be worth using netlify for hosting since it allows redirect configuration, and github doesn't.

jsheunis commented 9 months ago

sfb1451 / metadata-catalog

Index datasets with google's dataset search #78