typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
97 stars 36 forks source link

Running scraper gives: "error: 'Field' `version` must be a string" #1

Closed saneef closed 3 years ago

saneef commented 3 years ago

Description

I'm trying to crawl a local website and create index. Here is the config I'm using. But, I'm facing 'error': 'Fieldversionmust be a string.'. Am I missing anything in the config?

config.json:

{
  "index_name": "test-local-dev-site",
  "start_urls": ["http://192.168.1.100/solutions/"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".page-header__nav ul li a[data-state=active]",
      "default_value": "Home",
      "global": true
    },
    "lvl1": {
      "selector": "article h1",
      "global": true
    },
    "lvl2": {
      "selector": "article h2",
      "global": true
    },
    "lvl3": {
      "selector": "article h3",
      "global": true
    },
    "lvl4": {
      "selector": "article h4",
      "global": true
    },
    "text": "article p, article li"
  }
}
TYPESENSE_API_KEY=the-generated-api-kay
TYPESENSE_HOST=192.168.1.100
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http

Steps to reproduce

When running scraper, I'm getting 'error': 'Fieldversionmust be a string.'. Here is longer log:

DEBUG:urllib3.connectionpool:http://192.168.1.100:8108 "POST /collections/test-local-dev-site_1626455860/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:192.168.1.100:8108 is healthy. Status code: 200
[{'code': 400, 'document': '{"content": "Our users solve problems with yet-another-db in fields ranging from predictive risk systems in investment banks and fraud detection via graph analytics to authorization for IoT data and temporal queries across financial transactions on enterprise blockchains", "hierarchy": {"lvl0": "Solutions", "lvl1": "yet-another-db in Production", "lvl2": null, "lvl3": null, "lvl4": null, "lvl5": null, "lvl6": null}, "hierarchy_radio": {"lvl4": null, "lvl3": null, "lvl2": null, "lvl1": null, "lvl0": null}, "type": "content", "tags": [], "weight": {"page_rank": 0, "level": 0, "position": 0}, "url": "http://192.168.1.100/solutions/", "url_without_variables": "http://192.168.1.100/solutions/", "hierarchy_camel": [{"lvl0": "Solutions", "lvl1": "yet-another-db in Production", "lvl2": null, "lvl3": null, "lvl4": null, "lvl5": null, "lvl6": null}], "hierarchy_radio_camel": {"lvl4": null, "lvl3": null, "lvl2": null, "lvl1": null, "lvl0": null}, "content_camel": "Our users solve problems with yet-another-db in fields ranging from predictive risk systems in investment banks and fraud detection via graph analytics to authorization for IoT data and temporal queries across financial transactions on enterprise blockchains", "language": "en", "version": ["1.0.0", "latest"], "url_without_anchor": "http://192.168.1.100/solutions/", "no_variables": true, "objectID": "ec4386a15bc9dc3e8268e2dc90c9608e7bba6f28", "item_priority": 0, "hierarchy.lvl0": "Solutions", "hierarchy.lvl1": "yet-another-db in Production"}', 'error': 'Field `version` must be a string.', 'success': False}]
ERROR:scrapy.core.scraper:Spider error processing <GET http://192.168.1.100/solutions/> (referer: None)
Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/root/src/documentation_spider.py", line 177, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/root/src/documentation_spider.py", line 149, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/root/src/typesense_helper.py", line 65, in add_records
    raise Exception
Exception
2021-07-16 17:17:40 [scrapy.core.scraper] ERROR: Spider error processing <GET http://192.168.1.100/solutions/> (referer: None)
Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/root/src/documentation_spider.py", line 177, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/root/src/documentation_spider.py", line 149, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/root/src/typesense_helper.py", line 65, in add_records
    raise Exception
Exception
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 216,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2294,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.235181,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 7, 16, 17, 17, 40, 481790),
 'log_count/ERROR': 1,
 'memusage/max': 62435328,
 'memusage/startup': 62435328,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/Exception': 1,
 'start_time': datetime.datetime(2021, 7, 16, 17, 17, 40, 246609)}
INFO:scrapy.core.engine:Spider closed (finished)

Crawling issue: nbHits 0 for test-local-dev-site

Expected Behavior

Indexing to succeed.

Actual Behavior

Indexing fails.

Metadata

Typesense Version: 0.21.0

OS: macOS 11.4

jasonbosco commented 3 years ago

@saneef I just pushed out a fix for this, could you do docker pull typesense/docsearch-scraper and then try again?

saneef commented 3 years ago

@jasonbosco The fix works! Thanks a lot for the quick fix.