typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
101 stars 36 forks source link

Allow passing custom collection options #33

Closed marcospassos closed 1 year ago

marcospassos commented 1 year ago

Description

Algolia allows you to pass custom settings through the custom_settings options in the docsearch.config.json: https://github.com/algolia/docsearch-configs/blob/master/configs/docusaurus-2.json#L29-L30

Actual Behavior

To specify a custom configuration like token_separators or symbols_to_index, I'd have to fork the scrapper and the GH action to make it work.

Expected Behavior

I expected Typesense to provide the same flexibility as Agolia, allowing it to pass any collection configuration.

jasonbosco commented 1 year ago

@marcospassos I've published typesense/docsearch-scraper:0.5.0 which adds support for setting custom token_separators and symbols_to_index.

You should now be able to do something like this in the scraper config:

{
  "index_name": "typesense_docs",
  "start_urls": [
    {
      "url": "https://typesense.org/docs/(?P<version>.*?)/",
      "variables": {
        "version": [
          "0.21.0"
        ]
      }
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".content__default h1",
      "lvl1": ".content__default h2",
      "lvl2": ".content__default h3",
      "lvl3": ".content__default h4",
      "lvl4": ".content__default h5",
      "text": ".content__default p, .content__default ul li, .content__default table tbody tr"
    }
  },
  "custom_settings": {
    "token_separators": ["_"], // <=====
    "symbols_to_index": ["*"],  // <=====
  }
}

I've also changed the default token separators to ['_', '-'].

Could you give it a shot and let me know?