多种 OCR 引擎结果评估后处理

wanghaisheng commented 6 years ago

Process, enhance and evaluate multiple ocr-ouput. https://github.com/UB-Mannheim/ocromore

wanghaisheng commented 6 years ago

Parsing all ocr-outputfiles to an database
(This step only has to be done once)
Pre-process the gathered information
The results from the following processes can also be stored directly to the database
- Line-matching all files
- Unspacing words in each file
  Unspacing means to delete whitespaces in spaced text
  (E.g. H e l l o => Hello)
- Word-matching all files per line
Combine file information
- Different compare methods
  - Textdistance-Keying
    - Levenshtein
    - Damerau-Levenshtein
    - ...
  - Multi-Sequence-Alignment (MSA)
    - pivot-based
    - linewise/wordwise
    - Adjustable search-space-processor correction
      - Matching similar character
      - Whitespace/Wildcard improvements
    - Adjustable decision parameter
      - Char confidence
      - Best-of-n
The output can be stored in the database and/or as .txt or .hocr.
Evaluate the output against groundtruth files or each other and generate a accuracy report. Or compare the files visual via diff-tools.

wanghaisheng commented 6 years ago

https://github.com/KBNLresearch/ochre

Ochre

Ochre is a toolbox for OCR post-correction.

Overview of OCR post-correction data sets
Preprocess data sets
Train character-based language models/LSTMs for OCR post-correction
Do the post-correction
Assess the performance of OCR post-correction
Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.

wanghaisheng / awesome-ocr

多种 OCR 引擎结果评估后处理 #79