sfb1451 / metadata-catalog

The SFB 1451 data portal (metadata catalog)
https://data.sfb1451.de
0 stars 1 forks source link

Index datasets with google's dataset search #78

Open jsheunis opened 9 months ago

jsheunis commented 9 months ago

See: https://github.com/datalad/datalad-catalog/issues/20

jsheunis commented 9 months ago

The PR added the components that were supposedly necessary for supporting rich results and page indexing, but this didn't seem to have the desired effect. After the merged PR, the data.sfb1451.de site was added to google's search console, the site was verified (by adding a custom meta tag to the index.html page), and the sitemap (sitemap.txt) was submitted via the search console to allow indexing of the list of pages available at the catalog site.

As of the time of this comment, the indexing process is still busy. Also, using the rich results test, none of the tested URLs (from the list in the sitemap, i.e. live pages on the catalog site) seem to return a successful check for rich results, even though the pages in question do show the necessary structured data in a script tag in the head of the html document, when inspecting via dev tools in the browser.

After this became evident, there followed a lot of debugging. The experienced issues suggested that the adding of the sitemap.txt file and the robots.txt file to the catalog site might have caused the failed rich results test on specific pages. To support this point, another staging site (jsheunis.github.io/sfb145-projects-catalog) returned a successful result on the test without having the text files committed. But two points seem to speak against this assessment:

So, some more investigation followed. I looked into the possibility of the particular framework, VueJS and Vue-router, causing these failures in the rich results tests. There's some online comments about latency in client-side rendering that's possibly causing google's crawlers not to render and index the pages on first try, and that they have to come back at a later stage to do so (if/when they can). The latency could be a result of javascript execution and front-end rendering, or asynchronous data fetching calls, or both. Both are applicable in the case of the vuejs app. Perhaps the rich results tests struggle to get all necessary information on first try in this scenario.

More investigation showed that there are known issues with search-engine-optimization processes (crawlers / indexing) and single-page-apps such as the VueJS application of the catalog site (e.g.: https://madewithvuejs.com/blog/how-to-make-vue-js-single-page-applications-seo-friendly-a-beginner-s-guide). Suggestions to mitigate this include using "history mode" for vue-router, using server-side rendering (https://vuejs.org/guide/scaling-up/ssr.html), prebuilding the entire application into a set of static html files (using e.g. nuxt.js).

Some investigations into history mode brought this issue to the front: https://stackoverflow.com/questions/65501787/vue-router-github-pages-and-custom-domain-not-working-with-routed-links. In my own local tests with history mode, I encountered the same issue of getting a 404 when the url contains parameters that should open a component view in the browser.

Some relevant links/issues:

To be continued

jsheunis commented 9 months ago

A new day, a new surprise. Here are several recent successful tests for pages with rich results on the https://data.sfb1451.de site:

From this, I will assume that the component that loads structured data into a dataset page works for now (although it could probably still be improved wrt to vuejs rendering latency, asynchronous calls, history mode etc). I am, however, not sure why it works. It could be a delayed result of removing the sitemap and robots files. I doubt that. It could be google's tests that are erratic. I don't know. It could be some interference of the request to start indexing the site with the live tests for rich results?

Anyway, IMO the next step is to wait for the indexing to finish before we can test the google dataset search.

What I will work on in the mean time is figuring out if a sitemap is necessary (or just useful) for the crawlers to index the catalog's pages, or if an alternative is possible that doesn't require a registry of all pages in a catalog to be maintained, since that would be counter to the idea of a decentralized catalog that can be contributed to without an overseeing maintainer.

jsheunis commented 9 months ago

A report from lighthouse:

Lighthouse Report Viewer.pdf

Some screenshots:

Screenshot 2024-01-25 at 16 28 02
jsheunis commented 9 months ago

Making links crawlable: https://developers.google.com/search/docs/crawling-indexing/links-crawlable

So basically they need an href, and many/most pf the links in a catalog do not have those since they use javascript onclick or vuejs @click.

This content about SEO and Javascript looks very applicable:

https://developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics#use-history-api

For single-page applications with client-side routing, use the History API to implement routing between different views of your web app. To ensure that Googlebot can parse and extract your URLs, avoid using fragments to load different page content.

function goToPage(event) {
  event.preventDefault(); // stop the browser from navigating to the destination URL.
  const hrefUrl = event.target.getAttribute('href');
  const pageToLoad = hrefUrl.slice(1); // remove the leading slash
  document.getElementById('placeholder').innerHTML = load(pageToLoad);
  window.history.pushState({}, window.title, hrefUrl) // Update URL as well as browser history.
}

// Enable client-side routing for all links on the page
document.querySelectorAll('a').forEach(link => link.addEventListener('click', goToPage));

What I derive from the above is:

jsheunis commented 9 months ago

Insightful thread about github and single-page-apps and redirects: https://github.com/isaacs/github/issues/408. Might be worth using netlify for hosting since it allows redirect configuration, and github doesn't.

jsheunis commented 9 months ago

Further reading suggests that:

  1. History mode is a necessity since google's crawlers might not even index hash-mode pages (with a # in the URL)
    • this means that page redirects is a necessity (because of the issue noted above)
    • this means that we can't use github pages to serve the app/site because github does not support page redirects (some people on the internet found a workaround by using the github pages feature of a custom 404.html page, which then contains a redirect script taking the browser to the index.html page, but google's crawlers stopped following 404 page redirects in recent years, i.e. no good for SEO)
  2. All links should have an href property even if the main result of a user clicking the link should be the JS code that gets executed; in this way, google's crawlers will be able to follow the links without having to execute scripts:
    • this means anchor tags should always have an :href property and the @click or onclick routines should use preventDefault to prevent the browser from following the href address, so that the JS code can be the determining factor for what happens when a user clicks.
    • additional note: if I'm reading the internet correctly, then the value returned by the onclick event determines whether the href address is followed by the browser (yes if true, no if not); TODO: test this.
  3. vue-meta could provide a way to update route-specific metadata (think title or any other tags that go into the html page head) in order to improve SEO performance: https://vue-meta.nuxtjs.org/. Could investigate whether this is a better option for adding structured data in a script tag than the current solution.
  4. Pages could have a canonical link tag to help google identify a page as the true source of information: