Crawler configured by Datasources UI only crawls Startpage, although option "Crawl full domain..."

opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

https://opensemanticsearch.org

GNU General Public License v3.0

941 stars 164 forks source link

Crawler configured by Datasources UI only crawls Startpage, although option "Crawl full domain..." #454

Open rafkamonday opened 1 year ago

rafkamonday commented 1 year ago

Hello,

I try to crawl a webpage (full domain) but never will be crawled more than the startpage. In the Datasources UI I tried http and https, with www and without, with trailing slash and without. It never works. I would expect that the crawler will follow the links found in the startpage. I have no idea why it does not work as expected.

(The whole installation was made on bullseye with "one command" as documented in https://opensemanticsearch.org/doc/admin/install/search_server/ )

Tiberius1313 commented 1 year ago

on some pages it worked fine for me. but then I run into the same as you described with https://hudoc.echr.coe.int . to see if there are similarities in the structure it might be helpful to name your pages.

fractalvision commented 1 year ago

Signing under that, still persists.