Open jsheunis opened 9 months ago
The PR added the components that were supposedly necessary for supporting rich results and page indexing, but this didn't seem to have the desired effect. After the merged PR, the data.sfb1451.de site was added to google's search console, the site was verified (by adding a custom meta tag to the index.html
page), and the sitemap (sitemap.txt
) was submitted via the search console to allow indexing of the list of pages available at the catalog site.
As of the time of this comment, the indexing process is still busy. Also, using the rich results test, none of the tested URLs (from the list in the sitemap, i.e. live pages on the catalog site) seem to return a successful check for rich results, even though the pages in question do show the necessary structured data in a script
tag in the head
of the html document, when inspecting via dev tools in the browser.
After this became evident, there followed a lot of debugging. The experienced issues suggested that the adding of the sitemap.txt
file and the robots.txt
file to the catalog site might have caused the failed rich results test on specific pages. To support this point, another staging site (jsheunis.github.io/sfb145-projects-catalog) returned a successful result on the test without having the text files committed. But two points seem to speak against this assessment:
sitemap.txt
file of the catalog site submitted via the search console was processed successfully without errors, and all URLs were recognised (94 of them). i.e. it doesn't seem to be an issue with the sitemap file itself (or the txt
format as opposed to the more ubiquitous xml
format)So, some more investigation followed. I looked into the possibility of the particular framework, VueJS and Vue-router, causing these failures in the rich results tests. There's some online comments about latency in client-side rendering that's possibly causing google's crawlers not to render and index the pages on first try, and that they have to come back at a later stage to do so (if/when they can). The latency could be a result of javascript execution and front-end rendering, or asynchronous data fetching calls, or both. Both are applicable in the case of the vuejs app. Perhaps the rich results tests struggle to get all necessary information on first try in this scenario.
More investigation showed that there are known issues with search-engine-optimization processes (crawlers / indexing) and single-page-apps such as the VueJS application of the catalog site (e.g.: https://madewithvuejs.com/blog/how-to-make-vue-js-single-page-applications-seo-friendly-a-beginner-s-guide). Suggestions to mitigate this include using "history mode" for vue-router, using server-side rendering (https://vuejs.org/guide/scaling-up/ssr.html), prebuilding the entire application into a set of static html files (using e.g. nuxt.js).
Some investigations into history mode brought this issue to the front: https://stackoverflow.com/questions/65501787/vue-router-github-pages-and-custom-domain-not-working-with-routed-links. In my own local tests with history mode, I encountered the same issue of getting a 404 when the url contains parameters that should open a component view in the browser.
Some relevant links/issues:
To be continued
A new day, a new surprise. Here are several recent successful tests for pages with rich results on the https://data.sfb1451.de site:
From this, I will assume that the component that loads structured data into a dataset page works for now (although it could probably still be improved wrt to vuejs rendering latency, asynchronous calls, history mode etc). I am, however, not sure why it works. It could be a delayed result of removing the sitemap and robots files. I doubt that. It could be google's tests that are erratic. I don't know. It could be some interference of the request to start indexing the site with the live tests for rich results?
Anyway, IMO the next step is to wait for the indexing to finish before we can test the google dataset search.
What I will work on in the mean time is figuring out if a sitemap is necessary (or just useful) for the crawlers to index the catalog's pages, or if an alternative is possible that doesn't require a registry of all pages in a catalog to be maintained, since that would be counter to the idea of a decentralized catalog that can be contributed to without an overseeing maintainer.
Making links crawlable: https://developers.google.com/search/docs/crawling-indexing/links-crawlable
So basically they need an href, and many/most pf the links in a catalog do not have those since they use javascript onclick
or vuejs @click
.
This content about SEO and Javascript looks very applicable:
For single-page applications with client-side routing, use the History API to implement routing between different views of your web app. To ensure that Googlebot can parse and extract your URLs, avoid using fragments to load different page content.
function goToPage(event) {
event.preventDefault(); // stop the browser from navigating to the destination URL.
const hrefUrl = event.target.getAttribute('href');
const pageToLoad = hrefUrl.slice(1); // remove the leading slash
document.getElementById('placeholder').innerHTML = load(pageToLoad);
window.history.pushState({}, window.title, hrefUrl) // Update URL as well as browser history.
}
// Enable client-side routing for all links on the page
document.querySelectorAll('a').forEach(link => link.addEventListener('click', goToPage));
What I derive from the above is:
href
property, and the onclick event can use preventdefault
and then do whatever it needs to doInsightful thread about github and single-page-apps and redirects: https://github.com/isaacs/github/issues/408. Might be worth using netlify for hosting since it allows redirect configuration, and github doesn't.
Further reading suggests that:
#
in the URL)
404.html
page, which then contains a redirect script taking the browser to the index.html
page, but google's crawlers stopped following 404 page redirects in recent years, i.e. no good for SEO)href
property even if the main result of a user clicking the link should be the JS code that gets executed; in this way, google's crawlers will be able to follow the links without having to execute scripts:
:href
property and the @click
or onclick
routines should use preventDefault
to prevent the browser from following the href
address, so that the JS code can be the determining factor for what happens when a user clicks.onclick
event determines whether the href
address is followed by the browser (yes if true
, no if not); TODO: test this.vue-meta
could provide a way to update route-specific metadata (think title
or any other tags that go into the html page head
) in order to improve SEO performance: https://vue-meta.nuxtjs.org/. Could investigate whether this is a better option for adding structured data in a script
tag than the current solution.
See: https://github.com/datalad/datalad-catalog/issues/20