wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

多种 OCR 引擎结果评估后处理 #79

Closed wanghaisheng closed 6 years ago

wanghaisheng commented 6 years ago

Process, enhance and evaluate multiple ocr-ouput. https://github.com/UB-Mannheim/ocromore

wanghaisheng commented 6 years ago

  1. Parsing all ocr-outputfiles to an database
    (This step only has to be done once)

  2. Pre-process the gathered information
    The results from the following processes can also be stored directly to the database

    • Line-matching all files
    • Unspacing words in each file
      Unspacing means to delete whitespaces in spaced text
      (E.g. H e l l o => Hello)
    • Word-matching all files per line
  3. Combine file information

    • Different compare methods
      • Textdistance-Keying
        • Levenshtein
        • Damerau-Levenshtein
        • ...
      • Multi-Sequence-Alignment (MSA)
        • pivot-based
        • linewise/wordwise
        • Adjustable search-space-processor correction
          • Matching similar character
          • Whitespace/Wildcard improvements
        • Adjustable decision parameter
          • Char confidence
          • Best-of-n
  4. The output can be stored in the database and/or as .txt or .hocr.

  5. Evaluate the output against groundtruth files or each other and generate a accuracy report. Or compare the files visual via diff-tools.

wanghaisheng commented 6 years ago

https://github.com/KBNLresearch/ochre

Ochre

Ochre is a toolbox for OCR post-correction.

Overview of OCR post-correction data sets
Preprocess data sets
Train character-based language models/LSTMs for OCR post-correction
Do the post-correction
Assess the performance of OCR post-correction
Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.