typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Scrapper not crawling antora site #50

Closed wanderanimrod closed 9 months ago

wanderanimrod commented 9 months ago

Description

I followed these instructions to add typesense to a simple antora site. The crawler is only visiting the pages in the crawler config's start_urls list, but wasn't following links on those pages (it wasn't actually crawling, it was just scraping the start pages).

Steps to reproduce

  1. Build a simple antora site (instructions too many to reproduce here). The site should have at least one link on the home page linking to another page on the same site.
  2. Serve the it with python -m http.server 3000
  3. Configure the scrapper with this configuration:
    {
    "index_name": "my-index",
    "start_urls": [
    "http://host.docker.internal:3000/my-component/current/index.html"
    ],
    "selectors": {
    "lvl0": "h1",
    "lvl1": "h2",
    "lvl2": "h3",
    "lvl3": "h4",
    "lvl4": "h5",
    "lvl5": "h6",
    "text": "p,span,a"
    }
    }

Expected Behavior

  1. Crawler should visit the first page my-component/current/index.html, scrape it and then also visit the page linked from the home page, and scrape it.
  2. You should see more than one log line starting with: > DocSearch: and each page should have at least 1 record.

Actual Behavior

  1. You only see one > DocSearch: log line.
  2. When you search for any text on the second page (not the home page), you don't get any results.

Metadata

Typesense Version: 0.25.1

OS: MacOS Sonoma 14.0

wanderanimrod commented 9 months ago

I managed to get it working, finally!!

I ran the scrapper on the official antora.org docs, and it worked, so it meant that the problem was not the page or the markup, but my local setup.

I was making two mistakes:

  1. Using a port in my start_urls. The scraper apparently does not support urls with ports. So, instead of http://host.docker.internal:3000, you should use http://host.docker.internal. This means that you need to run your site on port 80 on localhost. I think the crawler interprets port-less relative hrefs as being on a different domain from the start_url. The docsearch docs explicitly say it doesn't craw links to different domains.
  2. Using index.html in the start_urls. Instead of http://host.docker.internal/my-component/current/index.html, I should have been using http://host.docker.internal/my-component/current.

These two fixes got the crawler working!

wanderanimrod commented 9 months ago

I struggled with this for so long. There was nothing about in Google search results, so I thought I would help future explorers by opening an issue here and documenting the solution.

I apologize if this is not the way things are done here.