opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Unittest test_warc (test_enhance_warc.Test_enhance_warc) fails due to bug in pysolr #154

Open opensemanticsearch opened 2 years ago

opensemanticsearch commented 2 years ago

Unittest fails because it can not delete the indexed document after the test:

======================================================================
ERROR: test_warc (test_enhance_warc.Test_enhance_warc)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/opensemanticetl/test_enhance_warc.py", line 31, in test_warc
    etl_delete.delete(contained_doc_id)
  File "/usr/lib/python3/dist-packages/opensemanticetl/etl_delete.py", line 60, in delete
    self.connector.delete(parameters=self.config, docid=uri)
  File "/usr/lib/python3/dist-packages/opensemanticetl/export_solr.py", line 351, in delete
    result = solr.delete(id=docid)
  File "/usr/lib/python3/dist-packages/pysolr.py", line 960, in delete
    return self._update(m, commit=commit, softCommit=softCommit, waitFlush=waitFlush, waitSearcher=waitSearcher, handler=handler)
  File "/usr/lib/python3/dist-packages/pysolr.py", line 500, in _update
    return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
  File "/usr/lib/python3/dist-packages/pysolr.py", line 412, in _send_request
    raise SolrError(error_message % (resp.status_code, solr_message))
pysolr.SolrError: Solr responded with an error (HTTP 400): [Reason: Unexpected character ':' (code 58) excepted space, or '>' or "/>"
 at [row,col {unknown-source}]: [1,41]]

Reason: https://github.com/django-haystack/pysolr/issues/368

Seems we have to wait for new release in python repo: https://github.com/django-haystack/pysolr/issues/373