Open justin5267 opened 2 years ago
In addition, I also tried to export the collection, manually set the scheme, and then import the same jsonl file, but failed with this error:
schema = {
"name": "docs6",
"fields": [
{"name": ".*", "type": "auto","locale":"zh"},
]
}
client.collections.create(schema)
with open('0530.jsonl') as jsonl_file:
client.collections['docs6'].documents.import_(jsonl_file.read().encode('utf-8'), {'action': 'create'})
{"code":400,"document":"{\\"content\\":\\"敬请期待!\\",\\"content_camel\\":\\"敬请期待!\\",\\"hierarchy\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_camel\\":[{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null}],\\"hierarchy_radio\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"hierarchy_radio_camel\\":{\\"lvl0\\":null,\\"lvl1\\":null,\\"lvl2\\":null,\\"lvl3\\":null,\\"lvl4\\":null,\\"lvl5\\":null,\\"lvl6\\":null},\\"id\\":\\"4135\\",\\"item_priority\\":0,\\"no_variables\\":true,\\"objectID\\":\\"24f11103459d1ea33a3b2feac731300fb8973cc0\\",\\"tags\\":[],\\"type\\":\\"content\\",\\"url\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_anchor\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"url_without_variables\\":\\"https://www.diglaws.com/civil_law/index.html\\",\\"weight\\":{\\"level\\":0,\\"page_rank\\":0,\\"position\\":0}}","error":"Type of field `hierarchy_camel` is invalid.","success":false}'
I modified typesense_helper.py
and added some locale:zh
, now Chinese characters are segmented as expected.
self.typesense_client.collections.create({
'name': self.collection_name_tmp,
'fields': [
{'name': 'anchor', 'type': 'string', 'optional': True},
{'name': 'content', 'type': 'string', "locale": "zh", 'optional': True},
{'name': 'url', 'type': 'string', 'facet': True},
{'name': 'version', 'type': 'string[]', 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl0', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl1', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl2', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl3', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl4', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl5', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': 'hierarchy.lvl6', 'type': 'string', "locale": "zh", 'facet': True, 'optional': True},
{'name': '.*_tag', 'type': 'string', 'facet': True, 'optional': True},
{'name': 'language', 'type': 'string', 'facet': True, 'optional': True},
{'name': 'tags', 'type': 'string[]', 'facet': True, 'optional': True},
{'name': 'item_priority', 'type': 'int64'},
],
'default_sorting_field': 'item_priority'
})
I am not sure if the problem has been solved, for the following error is displayed during the operation.,and I don't know if it matters.
>DocSearch: http://www.diglaws.com/civil_procedure_law/A2-2.html 27 records) 2022-05-31 00:24:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.diglaws.com/civil_procedure_law/A2-2.html> (referer: None) Traceback (most recent call last): File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc] File "C:\Users\Justin\test_site\utility\typesense-docsearch-scraper-master\cli\..\scraper\src\documentation_spider.py", line 182, in parse_from_start_url return self.parse(response) File "C:\Users\Justin\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\__init__.py", line 70, in parse raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined') NotImplementedError: DocumentationSpider.parse callback is not defined
I am using docsearch scraper to index my website. In order to automatically segment Chinese characters, I need to add
locale:zh
to the field ofcontent
。First,I tried to add
locale:zh
in the config file`of docsearch scraper, but it doesn’t work.Then,I tried to add a tag in the meta data, and it doesn’t work either.
<meta name="docsearch:locale_tag" content="zh" />
Finally, I tried to update the field''s definition, but it is not supported to do so.
Typesense currently does not support in-place updates to a field's definition once it is added to the schema.
I hope there could be a locale option in the config file of docsearch scraper, and after setting
locale:zh
to a specific selector or set it globally, the field generated by the docsearch scraper can automatically have such definition.