Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
I try to crawl a webpage (full domain) but never will be crawled more than the startpage. In the Datasources UI I tried http and https, with www and without, with trailing slash and without. It never works. I would expect that the crawler will follow the links found in the startpage. I have no idea why it does not work as expected.
on some pages it worked fine for me. but then I run into the same as you described with https://hudoc.echr.coe.int . to see if there are similarities in the structure it might be helpful to name your pages.
Hello,
I try to crawl a webpage (full domain) but never will be crawled more than the startpage. In the Datasources UI I tried http and https, with www and without, with trailing slash and without. It never works. I would expect that the crawler will follow the links found in the startpage. I have no idea why it does not work as expected.
(The whole installation was made on bullseye with "one command" as documented in https://opensemanticsearch.org/doc/admin/install/search_server/ )