opensemanticsearch / open-semantic-search-apps

Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)
https://opensemanticsearch.org/
GNU General Public License v3.0
95 stars 37 forks source link

Problem with indexing web URLs #22

Closed pedrammehrdad closed 6 years ago

pedrammehrdad commented 6 years ago

Hi, I've noticed that URL those contains some Persian or Arabic words, will fail to index through rest API or console. For example: Request: http://192.168.1.154/search-apps/api/index-web?uri=https://www.nimrokh.tv/news/38938/دلال-بازی-تلگرامچی-بی-مخاطب-سینمایی Response: {"queue": "7a746810-7d03-4c97-9f78-980ae708cc88"} So I know open semantic search will try to crawl and index this URL but as long as the URL contains Persian characters open semantic search will fail to crawl it. Any hint on this?

pedrammehrdad commented 6 years ago

This issue is related to open-semantic-etl repository.