grammyjs / website

The grammY documentation website.
https://grammy.dev
MIT License
46 stars 105 forks source link

Algolia crawling fails #855

Closed KnorpelSenf closed 11 months ago

KnorpelSenf commented 1 year ago

There was a blocking error and the crawler has been paused. Please resolve it before moving forward. Without intervention, the crawl will be discarded and retried on the next schedule (2 attempts remaining).

IMG_20230704_193726_301

Related: https://www.algolia.com/doc/tools/crawler/apis/configuration/

quadratz commented 1 year ago

Can you show me your crawler configuration? Don't forget to hide the API key

*And for the shake of meme:

AgADGgIAAkiWcEU.jpg

KnorpelSenf commented 1 year ago

We didn't configure it, it's Algolia that hosts it for us. They used to have their configuration stored in https://github.com/algolia/docsearch-configs/blob/master/configs/grammy.json and https://github.com/algolia/docsearch-configs/blob/master/deployed-configs/g/grammy.js but then the migrated their infrastructure to … something very different (I never really tried to understand it) and now it's stored somewhere on their servers.

I can try to dig through the dashboards in some time and see if I find anything that's related to this issue. Either way, the config has never been changed.

quadratz commented 1 year ago

https://github.com/algolia/docsearch-configs/blob/master/deployed-configs/g/grammy.js

Ah, I think we found the culprit. They are looking for vuepress class which is doesn't exist. My cm-grammy.netlify.dev just got approval for their crawling program. I will try to do some experiment tonight and give you the fixed config asap (hopefully).

KnorpelSenf commented 1 year ago

Don't bother. I got time to look into this. I found https://docsearch.algolia.com/docs/templates/#vitepress-template and will fix it now.

quadratz commented 1 year ago

Nice. Pretty outdated though. For example, lvl0 should select the active navlink in sidebar. Still better than nothing.

KnorpelSenf commented 1 year ago

Yep it works pretty poorly, I'm still investigating

KnorpelSenf commented 1 year ago

The old index for the VuePress site has these many records:

image

The new config and the VitePress site only has these many:

image

So for some reason it does not find all the content. I am not sure why.

quadratz commented 1 year ago

Pretty much the same:

image

Either we failed to index some information or the new one is more optimized. However, when comparing the search results with the vuepress, the outcome is the same or perhaps even better with more results.

Vuepress: https://github.com/grammyjs/website/pull/833#issuecomment-1609944061 Vitepress: https://cm-grammy.netlify.app

Screenshot ![Vitepress](https://github.com/grammyjs/website/assets/74030149/f301f1fb-39d1-4f6d-8e52-7b20f3c38e5a) ![Vuepress](https://github.com/grammyjs/website/assets/74030149/04f0eee0-fc91-4850-937a-758a1aaefcc5)
Config ```js new Crawler({ appId: "1FFMAU2VMZ", apiKey: "xxxxxx", rateLimit: 8, maxDepth: 10, startUrls: ["https://cm-grammy.netlify.app"], renderJavaScript: false, sitemaps: ["https://cm-grammy.netlify.app/sitemap.xml"], ignoreCanonicalTo: false, discoveryPatterns: ["https://cm-grammy.netlify.app/**"], actions: [ { indexName: "grammy", pathsToMatch: ["https://cm-grammy.netlify.app/**"], recordExtractor: ({ helpers }) => { return helpers.docsearch({ recordProps: { content: ".content p, .content li", lvl0: { selectors: ".VPSidebarItem.is-active .text", defaultValue: "Documentation", }, lvl1: ".content h1", lvl2: ".content h2", lvl3: ".content h3", lvl4: ".content h4", lvl5: ".content h5", lvl6: ".content h6", }, indexHeadings: true, aggregateContent: true, recordVersion: "v3", }); }, }, ], safetyChecks: { beforeIndexPublishing: { maxLostRecordsPercentage: 10 } }, initialIndexSettings: { grammy: { attributesForFaceting: ["type", "lang"], attributesToRetrieve: [ "hierarchy", "content", "anchor", "url", "url_without_anchor", "type", ], attributesToHighlight: ["hierarchy", "content"], attributesToSnippet: ["content:10"], camelCaseAttributes: ["hierarchy", "content"], searchableAttributes: [ "unordered(hierarchy.lvl0)", "unordered(hierarchy.lvl1)", "unordered(hierarchy.lvl2)", "unordered(hierarchy.lvl3)", "unordered(hierarchy.lvl4)", "unordered(hierarchy.lvl5)", "unordered(hierarchy.lvl6)", "content", ], distinct: true, attributeForDistinct: "url", customRanking: [ "desc(weight.pageRank)", "desc(weight.level)", "asc(weight.position)", ], ranking: [ "words", "filters", "typo", "attribute", "proximity", "exact", "custom", ], highlightPreTag: '', highlightPostTag: "", minWordSizefor1Typo: 3, minWordSizefor2Typos: 7, allowTyposOnNumericTokens: false, minProximity: 1, ignorePlurals: true, advancedSyntax: true, attributeCriteriaComputedByMinProximity: true, removeWordsIfNoResults: "allOptional", }, }, });
KnorpelSenf commented 1 year ago

Let's just go ahead then. If we find ways to improve the search in the future, we can still implement them.

This change isn't going to disrupt anything, so it shouldn't be blocking us. I will take care of updating the crawler config and index tomorrow.

KnorpelSenf commented 1 year ago

See https://github.com/vuejs/vitepress/issues/2592#issuecomment-1627642497

KnorpelSenf commented 11 months ago

Fixed.