typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
101 stars 36 forks source link

Use Port in `start_urls` #42

Open JasonWhall opened 1 year ago

JasonWhall commented 1 year ago

Description

We currently have a site that we set up in the scraper config that is hosted on a non-standard HTTP/HTTPS port (3000). When setting the start_urls to a hostname with a port e.g. http://my-host:3000/ , the scraper fails with an error message suggesting it does not accept domains with ports. It looks like the old algolia scraper configs used to support ports so I assume this is related to an update to the scrapy package used in this forked solution.

Steps to reproduce

Expected Behavior

Actual Behavior

Error returned from scraper:

PortWarning: allowed_domains accepts only domains without ports. Ignoring entry localhost:3000 in allowed_domains.
  warnings.warn(message, PortWarning)

Metadata

Typesense Version:

Docker images:

OS: Linux

jasonbosco commented 1 year ago

typesense-docsearch-scraper has all the commits from algolia-docsearch-scraper up to Dec 22, 2020. I don't see any updates in the algolia scraper since then where this port limitation was addressed...

Also I still see that error message about ports not allowed in allowed_domains in the master branch of scrapy here. So this limitation still exists as of today.

So I'm surprised to see a config in the docsearch scraper configs repo with a port number!

noghartt commented 11 months ago

Any update on that? I'm facing the same issue, but not understand if I'm able to test Typesense locally