Closed wanderanimrod closed 9 months ago
I managed to get it working, finally!!
I ran the scrapper on the official antora.org docs, and it worked, so it meant that the problem was not the page or the markup, but my local setup.
I was making two mistakes:
start_urls
. The scraper apparently does not support urls with ports. So, instead of http://host.docker.internal:3000
, you should use http://host.docker.internal
. This means that you need to run your site on port 80 on localhost. I think the crawler interprets port-less relative href
s as being on a different domain from the start_url
. The docsearch docs explicitly say it doesn't craw links to different domains.index.html
in the start_urls
. Instead of http://host.docker.internal/my-component/current/index.html
, I should have been using http://host.docker.internal/my-component/current
.These two fixes got the crawler working!
I struggled with this for so long. There was nothing about in Google search results, so I thought I would help future explorers by opening an issue here and documenting the solution.
I apologize if this is not the way things are done here.
Description
I followed these instructions to add typesense to a simple antora site. The crawler is only visiting the pages in the crawler config's
start_urls
list, but wasn't following links on those pages (it wasn't actually crawling, it was just scraping the start pages).Steps to reproduce
python -m http.server 3000
Expected Behavior
my-component/current/index.html
, scrape it and then also visit the page linked from the home page, and scrape it.> DocSearch:
and each page should have at least 1 record.Actual Behavior
> DocSearch:
log line.Metadata
Typesense Version:
0.25.1
OS: MacOS Sonoma
14.0