Open mikegerber opened 2 years ago
I was testing with a dummy Lorem Ipsum text and didn't notice that this produces 3x the output text due to the sliding window processing.
→ Back to figuring out how to preprocess text into the JSON format
With the newest changes, this works:
#!/bin/sh
OUT_DIR=.
create-ocr-json-of-single-page \
test-ocr.txt \
test-ocr.json
run-two-step-pipeline-on-single-page \
test-ocr.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_detector_model_512_3L_LSTM_bidirec_070920_138.pt \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_translator_model_256_1L_LSTM_monodirec_100920_876.pt \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_detector.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_translator.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_detector_150620.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_translator_080920.json \
$OUT_DIR
With additional reconstructing of the original line boundaries:
#!/bin/sh
set -ex
OUT_DIR=.
create-ocr-json-of-single-page \
actevedef_718448162_00000024.txt \
actevedef_718448162_00000024.txt.json
run-two-step-pipeline-on-single-page \
actevedef_718448162_00000024.txt.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_detector_model_512_3L_LSTM_bidirec_070920_138.pt \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/models/trained_translator_model_256_1L_LSTM_monodirec_100920_876.pt \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_detector.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/hyper_params/hyper_params_translator.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_detector_150620.json \
~/devel/qurator-data/sbb_ocr_postcorrection/data_models_final_121121/code_to_token/code_to_token_mapping_translator_080920.json \
$OUT_DIR
reconstruct-single-page-line-boundaries \
corrected_page.txt \
line_ids.json \
corrected_page_reconst.txt
This seems to work: