qurator-spk / sbb_ocr_postcorrection

Two-Step Approach to OCR Post-Correction
Apache License 2.0
14 stars 4 forks source link

Verify and document procedure to process text #4

Open mikegerber opened 2 years ago

mikegerber commented 2 years ago

This seems to work:

OUT_DIR=.
apply-sliding-window \                                                                                                                                       
  test-ocr.txt \                                                                                                                                             
  test-ocr.json                                                                                                                                              

run-two-step-pipeline-on-single-page \                                                                                                                       
  test-ocr.json \                                                                                                                                            
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_detector_model_512_3L_LSTM_bidirec_070920_138.pt \                     
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_translator_model_256_1L_LSTM_monodirec_100920_876.pt \                 
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_detector.json \                                             
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_translator.json \                                           
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_detector_150620.json \                            
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_translator_080920.json \                          
  $OUT_DIR 
mikegerber commented 2 years ago

I was testing with a dummy Lorem Ipsum text and didn't notice that this produces 3x the output text due to the sliding window processing.

→ Back to figuring out how to preprocess text into the JSON format

mikegerber commented 2 years ago

With the newest changes, this works:

#!/bin/sh
OUT_DIR=.

create-ocr-json-of-single-page \
  test-ocr.txt \
  test-ocr.json

run-two-step-pipeline-on-single-page \
  test-ocr.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_detector_model_512_3L_LSTM_bidirec_070920_138.pt \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_translator_model_256_1L_LSTM_monodirec_100920_876.pt \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_detector.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_translator.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_detector_150620.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_translator_080920.json \
  $OUT_DIR
mikegerber commented 2 years ago

With additional reconstructing of the original line boundaries:

#!/bin/sh
set -ex

OUT_DIR=.

create-ocr-json-of-single-page \
  actevedef_718448162_00000024.txt \
  actevedef_718448162_00000024.txt.json

run-two-step-pipeline-on-single-page \
  actevedef_718448162_00000024.txt.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_detector_model_512_3L_LSTM_bidirec_070920_138.pt \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_translator_model_256_1L_LSTM_monodirec_100920_876.pt \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_detector.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_translator.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_detector_150620.json \
  ~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_translator_080920.json \
  $OUT_DIR

reconstruct-single-page-line-boundaries \
  corrected_page.txt \
  line_ids.json \
  corrected_page_reconst.txt