opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Automated tests #94

Closed opensemanticsearch closed 3 years ago

opensemanticsearch commented 4 years ago

Automated tests by Python unittest

Mandalka commented 4 years ago

Added first automated test by Python unittest for etl_file.py in test_etl_file.py which automatically tests the extraction of text and title from test.pdf by Tika and the OCR of embedded PNG and JPG by tesseract plugin.

Mandalka commented 4 years ago

Added separated unittest for Tika plugin for text extraction.

Mandalka commented 4 years ago

Added separated unittest for OCR of embedded images in PDF files and for descewing by Scantailor.

Mandalka commented 4 years ago

Added check for ETL file, if a plugin threw an uncaught exception (within the plugin, since the ETL manages such exceptions, so an exception of one plugin will not break processing of other plugins/the whole document).

Mandalka commented 4 years ago

Added tests for Spacy NER plugin with test sentences for English and German.

Mandalka commented 4 years ago

Added documentation to https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/test/README.md

Mandalka commented 4 years ago

Added tests and test images for OCR of JPG and PNG and diabling OCR by Tika plugin (using Tika server).

Mandalka commented 4 years ago

Implemented automated tests for adding entities to Solr index by entity manager and extract from full text by Solr text tagger/normalize/link it by entity linker in unittest of Open Semantic Entity Search API.

Mandalka commented 4 years ago

Added tests for plugin for mapping of IDs/URLs/Paths.

Mandalka commented 4 years ago

Added tests for language detection plugin.

Mandalka commented 4 years ago

Added tests for email address extraction and email domain extraction plugin.

Mandalka commented 4 years ago

Added test for import of file from WARC web archive format.

Mandalka commented 4 years ago

Added tests for enhance_ocr_descew for descewing by scantailor and OCR by tesseract

Mandalka commented 3 years ago

Added full stack integration tests by docker-compose.etl.test.yml

Mandalka commented 3 years ago

More yet not available unittests for some modules while further TDD.