typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
97 stars 36 forks source link

Cannot set locale:zh when using docsearch-scraper #12

Open justin5267 opened 2 years ago

justin5267 commented 2 years ago

I am using docsearch scraper to index my website. In order to automatically segment Chinese characters, I need to add locale:zh to the field of content

First,I tried to add locale:zh in the config file`of docsearch scraper, but it doesn’t work.

{
  "index_name": "docs2",
  "start_urls": ["https://www.diglaws.com/"],
  "sitemap_urls": ["https://www.diglaws.com/sitemap.xml"],
  "selectors": {
     "lvl0": {
      "selector": "#article_title",
      "global": true         
        },
      "lvl1":  "#article_content h1",
      "lvl2":  "#article_content h2",
      "lvl3":  "#article_content h3",
      "lvl4":  "#article_content h4",
      "lvl5":  "#article_content h5",
      "lvl6":  "#article_content h6",
      "text": {
        "selector": "#article_content p, #article_content li, #article_content blockquote",
        "locale":"zh"
      }
    }
}
>>> client.collections['docs2'].retrieve()
{'created_at': 1653898837, 'default_sorting_field': 'item_priority', 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'anchor', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'content', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'url', 'optional': False, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'version', 'optional': True, 'sort': False, 'type': 'string[]'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl0', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl1', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl2', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl3', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl4', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl5', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'hierarchy.lvl6', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': '.*_tag', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'language', 'optional': True, 'sort': False, 'type': 'string'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'tags', 'optional': True, 'sort': False, 'type': 'string[]'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'item_priority', 'optional': False, 'sort': True, 'type': 'int64'}, {'facet': True, 'index': True, 'infix': False, 'locale': '', 'name': 'locale_tag', 'optional': True, 'sort': False, 'type': 'string'}], 'name': 'docs2_1653898837', 'num_documents': 54668, 'symbols_to_index': [], 'token_separators': []}

Then,I tried to add a tag in the meta data, and it doesn’t work either. <meta name="docsearch:locale_tag" content="zh" />

Finally, I tried to update the field''s definition, but it is not supported to do so. Typesense currently does not support in-place updates to a field's definition once it is added to the schema.

I hope there could be a locale option in the config file of docsearch scraper, and after setting locale:zh to a specific selector or set it globally, the field generated by the docsearch scraper can automatically have such definition.

justin5267 commented 2 years ago

In addition, I also tried to export the collection, manually set the scheme, and then import the same jsonl file, but failed with this error:

schema = {
  "name": "docs6",  
  "fields": [
{"name": ".*", "type": "auto","locale":"zh"},
  ]
}
client.collections.create(schema)

with open('0530.jsonl') as jsonl_file:
  client.collections['docs6'].documents.import_(jsonl_file.read().encode('utf-8'), {'action': 'create'})

{"code":400,"document":"{\\"content\\":\\"敬请期待!\\",\\"content_camel\\":\\"敬请期待!\\",\\"hierarchy\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_camel\\":[{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null}],\\"hierarchy_radio\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_radio_camel\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"id\\":\\"4135\\",\\"item_priority\\":0,\\"no_variables\\":true,\\"objectID\\":\\"24f11103459d1ea33a3b2feac731300fb8973cc0\\",\\"tags\\":[],\\"type\\":\\"content\\",\\"url\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_anchor\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_variables\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"weight\\":{\\"level\\":0,\\"page_rank\\":0,\\"position\\":0}}","error":"Type of field `hierarchy_camel` is invalid.","success":false}'
justin5267 commented 2 years ago

I modified typesense_helper.py and added some locale:zh, now Chinese characters are segmented as expected.

self.typesense_client.collections.create({
            'name': self.collection_name_tmp,
            'fields': [
                {'name': 'anchor', 'type': 'string', 'optional': True},
                {'name': 'content', 'type': 'string', "locale": "zh", 'optional': True},
                {'name': 'url', 'type': 'string', 'facet': True},
                {'name': 'version', 'type': 'string[]', 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl0', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl1', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl2', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl3', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl4', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl5', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': 'hierarchy.lvl6', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
                {'name': '.*_tag', 'type': 'string', 'facet': True, 'optional': True},
                {'name': 'language', 'type': 'string', 'facet': True, 'optional': True},
                {'name': 'tags', 'type': 'string[]', 'facet': True, 'optional': True},
                {'name': 'item_priority', 'type': 'int64'},
            ],
            'default_sorting_field': 'item_priority'
        })

I am not sure if the problem has been solved, for the following error is displayed during the operation.,and I don't know if it matters.

>DocSearch: http://www.diglaws.com/civil_procedure_law/A2-2.html 27 records) 2022-05-31 00:24:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.diglaws.com/civil_procedure_law/A2-2.html> (referer: None) Traceback (most recent call last): File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc] File "C:\Users\Justin\test_site\utility\typesense-docsearch-scraper-master\cli\..\scraper\src\documentation_spider.py", line 182, in parse_from_start_url return self.parse(response) File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\__init__.py", line 70, in parse raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined') NotImplementedError: DocumentationSpider.parse callback is not defined