typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
101 stars 36 forks source link

Sitemap found but not crawled #22

Closed yves-v closed 1 year ago

yves-v commented 1 year ago

My sitemap is not being crawled. I expected to see a few records in the output which I pasted below, but nothing seems to happening. The urls in "start_urls" do work, and these come up in my searches.

DEBUG:scrapy.core.engine:Crawled (200) <GET http://host.docker.internal:3000/docs/frontend/intro> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET http://host.docker.internal:3000/sitemap.xml> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET http://host.docker.internal:3000/docs/api/intro> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET http://host.docker.internal:3000/> (referer: None)
DEBUG:typesense.api_call:Making post /collections/docusaurus-2_1674212290/documents/import
DEBUG:typesense.api_call:Try 1 to node host.docker.internal:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): host.docker.internal:8108
DEBUG:urllib3.connectionpool:http://host.docker.internal:8108 "POST /collections/docusaurus-2_1674212290/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:host.docker.internal:8108 is healthy. Status code: 200
> DocSearch: http://host.docker.internal:3000/docs/frontend/intro 18 records)
DEBUG:typesense.api_call:Making post /collections/docusaurus-2_1674212290/documents/import
DEBUG:typesense.api_call:Try 1 to node host.docker.internal:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): host.docker.internal:8108
DEBUG:urllib3.connectionpool:http://host.docker.internal:8108 "POST /collections/docusaurus-2_1674212290/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:host.docker.internal:8108 is healthy. Status code: 200
> DocSearch: http://host.docker.internal:3000/docs/api/intro 18 records)
DEBUG:typesense.api_call:Making post /collections/docusaurus-2_1674212290/documents/import
DEBUG:typesense.api_call:Try 1 to node host.docker.internal:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): host.docker.internal:8108
DEBUG:urllib3.connectionpool:http://host.docker.internal:8108 "POST /collections/docusaurus-2_1674212290/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:host.docker.internal:8108 is healthy. Status code: 200
> DocSearch: http://host.docker.internal:3000/ 1 records)
DEBUG:scrapy.spidermiddlewares.offsite:Filtered offsite request to 'host.docker.internal': <GET http://host.docker.internal:3000/docs/frontend/intro>
INFO:scrapy.core.engine:Closing spider (finished)
yves-v commented 1 year ago

I found a solution: added "allowed_domains":["host.docker.internal"] to the config file to make it work