typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Content of Some links cannot be crawled #65

Open simmonn opened 3 months ago

simmonn commented 3 months ago

Description

Hi, I encountered a problem. After executing the scraper, I found that the content of some links cannot be crawled. The logs show 0 records. I have tried many methods, but it still cannot be crawled.

here is the snapshot of logs: image

Steps to reproduce

here is part of my config

{
  "index_name": "docs",
  "sitemap_urls": [
    "https://mydomain/sitemap.xml"
  ],
  "start_urls": [
    {
      "url": "https://mydomain/guides",
      "tags": [
        "guides"
      ],
      "selectors_key": "guides"
    }
  ],
  "stop_urls": [],
  "selectors": {
    "default": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "文档"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    },
    "guides": {
      "lvl0": {
        "selector": "",
        "global": true,
        "default_value": "开发指南"
      },
      "lvl1": "article h1",
      "lvl2": "article h2",
      "lvl3": "article h3",
      "lvl4": "article h4",
      "lvl5": "article h5, article th, article td:first-child",
      "lvl6": "article h6",
      "text": "article p, article li, article td"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag",
      "tags"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "nb_hits": 2227
}

Expected Behavior

I hope to crawl the content of all the links in the configuration into Typesense.

Actual Behavior

Content cannot be searched

image

Metadata

Typesense Version: maybe 0.24,I don't know how to get to know version

OS:x86_64 GNU/Linux

jasonbosco commented 3 months ago

Could you make sure the html selectors exist on that page?

Also, could you make sure that the base url of those links are specified in start_urls section?

simmonn commented 2 months ago

Could you make sure the html selectors exist on that page?

Also, could you make sure that the base url of those links are specified in start_urls section?

Yes, I had configured it. These selectors can be selected using XPath expressions in the Chrome console. And I tried using BeautifulSoup to compress the HTML source code, which can solve the problem. But I'm not sure what the root cause is. here is the code : image