opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
GNU General Public License v3.0
254 stars 69 forks source link

Automated tests #94

Closed opensemanticsearch closed 3 years ago

opensemanticsearch commented 4 years ago

Automated tests by Python unittest

Mandalka commented 4 years ago

Added first automated test by Python unittest for in which automatically tests the extraction of text and title from test.pdf by Tika and the OCR of embedded PNG and JPG by tesseract plugin.

Mandalka commented 4 years ago

Added separated unittest for Tika plugin for text extraction.

Mandalka commented 4 years ago

Added separated unittest for OCR of embedded images in PDF files and for descewing by Scantailor.

Mandalka commented 4 years ago

Added check for ETL file, if a plugin threw an uncaught exception (within the plugin, since the ETL manages such exceptions, so an exception of one plugin will not break processing of other plugins/the whole document).

Mandalka commented 4 years ago

Added tests for Spacy NER plugin with test sentences for English and German.

Mandalka commented 4 years ago

Added documentation to

Mandalka commented 4 years ago

Added tests and test images for OCR of JPG and PNG and diabling OCR by Tika plugin (using Tika server).

Mandalka commented 4 years ago

Implemented automated tests for adding entities to Solr index by entity manager and extract from full text by Solr text tagger/normalize/link it by entity linker in unittest of Open Semantic Entity Search API.

Mandalka commented 4 years ago

Added tests for plugin for mapping of IDs/URLs/Paths.

Mandalka commented 4 years ago

Added tests for language detection plugin.

Mandalka commented 4 years ago

Added tests for email address extraction and email domain extraction plugin.

Mandalka commented 4 years ago

Added test for import of file from WARC web archive format.

Mandalka commented 4 years ago

Added tests for enhance_ocr_descew for descewing by scantailor and OCR by tesseract

Mandalka commented 3 years ago

Added full stack integration tests by docker-compose.etl.test.yml

Mandalka commented 3 years ago

More yet not available unittests for some modules while further TDD.