typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Incomplete indexing of a large docusaurus site #62

Closed KevinMArtio closed 3 months ago

KevinMArtio commented 3 months ago

Description

I'm having a problem indexing my docusaurus site with the typesense scraper. The site has around 900 pages divided into 2 versions (450 pages per version). All pages on hierarchical levels 2 and 3 are not being indexed. Pages on levels 0 and 1 are indexed correctly.

I have done a test with the same site but with 90% of the pages removed, and all the pages are well indexed, including those in levels 2 and 3.

The scraper runs on docker locally and connects to a docker typesense on a private cloud (jelastic).

The scraper nb_hits for the complete version is ~3000 The scraper nb_hits for the 'lite' version is ~6000 (??)

Is there a limitation anywhere that could control this behavior ?

Steps to reproduce

It's quite complicated to reproduce the problem as it requires a large number of pages and I unfortunately can't share the contents of the documentation for legal reasons. What I've done, as described above, is to remove the majority of the pages while keeping some of the pages that weren't indexed, and these become indexed all right.

Expected Behavior

The content of all pages must come up when searching for a term present in this content.

image

Actual Behavior

The content of all pages from level 2 downwards is not indexed

image

Metadata

Typesense Version: 0.25.2 Scraper Version: 0.9.1

OS: Windows 11

jasonbosco commented 3 months ago

Could you share your docsearch-scraper config JSON?

KevinMArtio commented 3 months ago

Here it is :

{
  "index_name": "docusaurus-example",
  "start_urls": [
    "https://docs.example.org/"
  ],
  "sitemap_urls": [
    "https://docs.example.org/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "https://docs.example.org/changelog",
    "https://docs.example.org/blog"
  ],
  "allowed_domains": [
    "example.org"
  ],
  "js_render": true,
  "js_wait": 5,
  "use_anchors": true,
  "user_agent": "Typesense DocSearch Scraper",
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
        "type": "xpath",
        "global": true,
        "default_value": "Documentation"
      },
      "lvl1": "article h1, header h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td:last-child"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 3053
}
jasonbosco commented 3 months ago

The scraper settings look fine.

Another thing to check is if the Typesense node has sufficient RAM to index all the pages. You can verify current resource usage by hitting doing GET /stats.json on the Typesense node.

You should see any OOM errors in the scraper logs though...

If it's not an OOM issue on the Typesense side, another thing to check if there's enough memory / compute on the node that's running the scraper.

KevinMArtio commented 3 months ago

Thank you very much for pointing me in the right direction. There was no memory problem at the server level, but in the Chrome used by Selenium. Here's an extract from the logs:

DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=500 | data={"value":{"error":"unknown error","message":"unknown error: session deleted because of page crash\nfrom unknown error: cannot determine loading status\nfrom tab crashed\n  (Session info: headless chrome=113.0.5672.126)","stacktrace":"---removed for clarity---"}} | headers=HTTPHeaderDict({'Content-Length': '1067', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})

I was able to solve the problem by adding the argument -shm-size="2g" in the docker execution line.

I now have another problem, there are double or even triple results in the /search page, I think it's linked to the docusaurus plugin and not to indexing. Should I open an issue on the docusaurus-theme-search-typesense repo? It's clearly visible that 3 queries are sent when entering the search field

image

jasonbosco commented 3 months ago

Thank you for documenting the solution.

re: the other issue, yes - could you open a separate issue in the docusaurus theme repo?