Closed opensemanticsearch closed 3 years ago
Added first automated test by Python unittest for etl_file.py in test_etl_file.py which automatically tests the extraction of text and title from test.pdf by Tika and the OCR of embedded PNG and JPG by tesseract plugin.
Added separated unittest for Tika plugin for text extraction.
Added separated unittest for OCR of embedded images in PDF files and for descewing by Scantailor.
Added check for ETL file, if a plugin threw an uncaught exception (within the plugin, since the ETL manages such exceptions, so an exception of one plugin will not break processing of other plugins/the whole document).
Added tests for Spacy NER plugin with test sentences for English and German.
Added tests and test images for OCR of JPG and PNG and diabling OCR by Tika plugin (using Tika server).
Implemented automated tests for adding entities to Solr index by entity manager and extract from full text by Solr text tagger/normalize/link it by entity linker in unittest of Open Semantic Entity Search API.
Added tests for plugin for mapping of IDs/URLs/Paths.
Added tests for language detection plugin.
Added tests for email address extraction and email domain extraction plugin.
Added test for import of file from WARC web archive format.
Added tests for enhance_ocr_descew for descewing by scantailor and OCR by tesseract
Added full stack integration tests by docker-compose.etl.test.yml
More yet not available unittests for some modules while further TDD.
Automated tests by Python unittest