Feat/generate trainingsets

M3ssman commented 3 years ago

Include generation of Trainingdata Sets from OCR like ALTO V3, PAGE 2013, PAGE 2019 and Image Files (tif, jpeg)

kba commented 3 years ago

I've tested it now, unit tests pass and I managed to extract image-text pairs from the kant_aufklaerung_1784 sample in assets:

$ python3 ./generate_sets.py -d ../assets/data/kant_aufklaerung_1784/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml -i ../assets/data/kant_aufklaerung_1784/data/OCR-D-IMG/INPUT_0017.tif 
[SUCCESS] created '20' training data sets, please review

It would be useful to make -o required or at least print the output directory as part of the SUCCESS message.

Could the -i argument be optional and by default be derived from imageFilename (PAGE) / sourceImageInformation/filename (ALTO)?

We also need a section on at least the CLI usage in the README.md

M3ssman commented 3 years ago

For the arabic text that is included as text resource (288652), and that's causing trouble with bidi, please see the original image (binarized) 288652

Shreeshrii commented 3 years ago

@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.

kba commented 3 years ago

@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.

Sorry, I do not. But maybe you have OCR results in Devanagari to test the mechanics of this PR? What problems do you foresee with Devanagari?

Shreeshrii commented 3 years ago

What problems do you foresee with Devanagari?

I don't foresee any, but wanted to test with complex scripts, just in case there is any difference in processing.

maybe you have OCR results in Devanagari to test the mechanics of this PR?

Good idea. I can test using ALTO output from tesseract.

Devanagari or any other Indic language datasets in Page XML format

I found a set of files at https://github.com/ramayanaocr/ocr-comparison/tree/master/Transkribus/Input, which has the png files as well as the xml files (generated by transkribus, I guess). I tested with one of those files, while the console messages reported success, the files were not created. The summary option created a file, but the file had empty lines.

 tesstrain-extract-gt  /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png
[INFO   ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: False, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review

I tested with the Arabic image shared earlier in this thread with its xml file in resources, just to make sure that I had the PR installed correctly. That worked i.e. created the files. I haven't looked at the text within them.

tesstrain-extract-gt /home/ubuntu/tesstrain/tests/resources/xml/288652.xml -i /home/ubuntu/pagedeva/288652.png -o /home/ubuntu/pagedeva/output -s
[INFO   ] generate trainingsets of '/home/ubuntu/tesstrain/tests/resources/xml/288652.xml' with '/home/ubuntu/pagedeva/288652.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '33' training data sets in '/home/ubuntu/pagedeva/output', please review

Is there a compatibility issue with transkribus generated PAGE files?

Shreeshrii commented 3 years ago

I tested just now with ALTO output from tesseract and get the following warnings:

 tesstrain-extract-gt /home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml -i /home/ubuntu/tesstrain-San/test/iast/sandocs_2.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml' with '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.png' (min: 1, sum: True, reorder: False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:234: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:195: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
[SUCCESS] created '5' training data sets in 'training_data_sandocs_2', please review

EDIT: Earlier error with ALTO was because of typo in filename.

M3ssman commented 3 years ago

@Shreeshrii Thanks for pointing to PAGE-Files that miss `Word' elements at all!

Since that was the cause for the missing results in the provided Devanagari sample. I tried to fix this and integrated the file as new test resource. Unfortunately, I can't say a word about the textual outcome, so please update the PR and have a look again ...

Shreeshrii commented 3 years ago

@M3ssman I tried just now but am getting the same result as before.

 git log -3
commit 3fb94996ac42818b302850080a6f2535db12251e (HEAD -> pagesets)
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Sun Dec 13 10:44:47 2020 +0100

    [app][fix] handle page without word elements

commit 2f3566bc23a848e3df7801b2fa1a6ce1d417e7bc
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Mon Dec 7 14:19:58 2020 +0100

    [app][fix] filter invalid lines

commit 57ba229ace0c9ae74afb889916cba3555ef7b4d0
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Mon Dec 7 13:18:48 2020 +0100

    [app][test] fix test imports

 tesstrain-extract-gt  /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review

However, only the summary file is created in 'training_data_ram110'. File is attached.

ram110_summary.gt.txt

PS: I looked at the XML file and the Devanagari text in it has errors, so it is probably raw OCRed text and not corrected text for groundtruth.

Shreeshrii commented 3 years ago

I also tried with the ALTO 4.1 XML referenced in the issue I opened at https://github.com/OCR-D/ocrd_fileformat/issues/23 That fails with the following messages:

(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ tesstrain-extract-gt /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml -i /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml' with '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png' (min: 1, sum: True, reorder: False)
Traceback (most recent call last):
  File "/home/ubuntu/miniforge3/bin/tesstrain-extract-gt", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/cli.py", line 74, in main
    reorder=REORDER)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 351, in create
    self.xml_data, min_len=min_chars, reorder=reorder)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 184, in text_line_factory
    ns_prefix = _determine_namespace(xml_data)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 223, in _determine_namespace
    return [k for (k, v) in XML_NS.items() if v == root_tag][0]
IndexError: list index out of range

M3ssman commented 3 years ago

@Shreeshrii Thanks for pointing towards ALTO V4. I've missed this before, since we're using the latest official stable release, tesseract 4.1., which doesn't create this kind of ALTO data. I've added the ALTO V4 namespace declaration and it worked fine. Somehow, I found this surprising, since the ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?

Regarding the Devanagari Issue: Your git log looks well, the version matches. Maybe tesstrain-extract-gt in your current, active environment is outdated, so please drop it and do a fresh install afterwards. You can also do a pytest -v to run the so far included test cases (with their test datasets) and check the temporary outputs in your local /tmp/pytest-of-<account> dir.

Shreeshrii commented 3 years ago

pytest -v
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.7.6, pytest-6.2.0, py-1.10.0, pluggy-0.13.1 -- /home/ubuntu/miniforge3/bin/python3.7
cachedir: .pytest_cache
rootdir: /home/ubuntu/tesstrain-pagesets
collected 8 items

tests/test_generate_sets.py::test_create_sets_from_alto_and_tif PASSED                                                                                                                                                                [ 12%]
tests/test_generate_sets.py::test_create_sets_from_page2013_and_jpg PASSED                                                                                                                                                            [ 25%]
tests/test_generate_sets.py::test_create_sets_from_page2013_and_jpg_no_summary PASSED                                                                                                                                                 [ 37%]
tests/test_generate_sets.py::test_create_sets_from_page2019_and_png PASSED                                                                                                                                                            [ 50%]
tests/test_generate_sets.py::test_create_sets_from_ocrd_workdspace PASSED                                                                                                                                                             [ 62%]
tests/test_generate_sets.py::test_create_sets_from_ocrd_workdspace_fails PASSED                                                                                                                                                       [ 75%]
tests/test_generate_sets.py::test_handle_invalid_coords PASSED                                                                                                                                                                        [ 87%]
tests/test_generate_sets.py::test_handle_page_devanagari_with_texlines PASSED                                                                                                                                                         [100%]

============================================================================================================= warnings summary ==============================================================================================================
tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
  /home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:234: RuntimeWarning: Degrees of freedom <= 0 for slice
    keepdims=keepdims)

tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
  /home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:195: RuntimeWarning: invalid value encountered in true_divide
    arrmean, rcount, out=arrmean, casting='unsafe', subok=False)

tests/test_generate_sets.py::test_create_sets_from_alto_and_tif
  /home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in double_scalars
    ret = ret.dtype.type(ret / rcount)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================================================================================== 8 passed, 3 warnings in 12.88s =======================================================================================================

(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ tesstrain-extract-gt  /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review

(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ ls -l training_data_ram110
total 4
-rw-rw-r-- 1 ubuntu ubuntu 24 Dec 15 04:20 ram110_summary.gt.txt

Shreeshrii commented 3 years ago

The files are generated as part of the test:

(base) ubuntu@tesseract-ocr-1:/tmp/pytest-of-ubuntu/pytest-current/test_handle_page_devanagari_wicurrent$ ls -l
total 34492
-rw-rw-r-- 1 ubuntu ubuntu 22618835 Dec 15 04:18 ram110.png
-rw-rw-r-- 1 ubuntu ubuntu     2515 Dec 15 04:18 ram110_summary.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu      187 Dec 15 04:18 ram110_tl_10.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   603624 Dec 15 04:18 ram110_tl_10.tif
-rw-rw-r-- 1 ubuntu ubuntu       37 Dec 15 04:18 ram110_tl_11.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   266846 Dec 15 04:18 ram110_tl_11.tif
-rw-rw-r-- 1 ubuntu ubuntu      117 Dec 15 04:18 ram110_tl_12.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   550042 Dec 15 04:18 ram110_tl_12.tif
-rw-rw-r-- 1 ubuntu ubuntu      108 Dec 15 04:18 ram110_tl_13.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   601434 Dec 15 04:18 ram110_tl_13.tif
-rw-rw-r-- 1 ubuntu ubuntu      151 Dec 15 04:18 ram110_tl_14.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   651804 Dec 15 04:18 ram110_tl_14.tif
-rw-rw-r-- 1 ubuntu ubuntu      102 Dec 15 04:18 ram110_tl_15.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   520708 Dec 15 04:18 ram110_tl_15.tif
-rw-rw-r-- 1 ubuntu ubuntu      102 Dec 15 04:18 ram110_tl_16.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   516418 Dec 15 04:18 ram110_tl_16.tif
-rw-rw-r-- 1 ubuntu ubuntu      107 Dec 15 04:18 ram110_tl_17.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   745854 Dec 15 04:18 ram110_tl_17.tif
-rw-rw-r-- 1 ubuntu ubuntu      148 Dec 15 04:18 ram110_tl_18.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   615958 Dec 15 04:18 ram110_tl_18.tif
-rw-rw-r-- 1 ubuntu ubuntu      157 Dec 15 04:18 ram110_tl_19.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   560244 Dec 15 04:18 ram110_tl_19.tif
-rw-rw-r-- 1 ubuntu ubuntu       43 Dec 15 04:18 ram110_tl_1.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   127490 Dec 15 04:18 ram110_tl_1.tif
-rw-rw-r-- 1 ubuntu ubuntu      106 Dec 15 04:18 ram110_tl_20.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   561928 Dec 15 04:18 ram110_tl_20.tif
-rw-rw-r-- 1 ubuntu ubuntu      107 Dec 15 04:18 ram110_tl_21.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   662002 Dec 15 04:18 ram110_tl_21.tif
-rw-rw-r-- 1 ubuntu ubuntu      127 Dec 15 04:18 ram110_tl_22.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   548104 Dec 15 04:18 ram110_tl_22.tif
-rw-rw-r-- 1 ubuntu ubuntu      115 Dec 15 04:18 ram110_tl_23.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   704092 Dec 15 04:18 ram110_tl_23.tif
-rw-rw-r-- 1 ubuntu ubuntu       17 Dec 15 04:18 ram110_tl_24.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   105892 Dec 15 04:18 ram110_tl_24.tif
-rw-rw-r-- 1 ubuntu ubuntu        6 Dec 15 04:18 ram110_tl_2.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu    32458 Dec 15 04:18 ram110_tl_2.tif
-rw-rw-r-- 1 ubuntu ubuntu      137 Dec 15 04:18 ram110_tl_3.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   741314 Dec 15 04:18 ram110_tl_3.tif
-rw-rw-r-- 1 ubuntu ubuntu      145 Dec 15 04:18 ram110_tl_4.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   712610 Dec 15 04:18 ram110_tl_4.tif
-rw-rw-r-- 1 ubuntu ubuntu       36 Dec 15 04:18 ram110_tl_5.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   337642 Dec 15 04:18 ram110_tl_5.tif
-rw-rw-r-- 1 ubuntu ubuntu       99 Dec 15 04:18 ram110_tl_6.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   495246 Dec 15 04:18 ram110_tl_6.tif
-rw-rw-r-- 1 ubuntu ubuntu       97 Dec 15 04:18 ram110_tl_7.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   581738 Dec 15 04:18 ram110_tl_7.tif
-rw-rw-r-- 1 ubuntu ubuntu      103 Dec 15 04:18 ram110_tl_8.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   518032 Dec 15 04:18 ram110_tl_8.tif
-rw-rw-r-- 1 ubuntu ubuntu      137 Dec 15 04:18 ram110_tl_9.gt.txt
-rw-rw-r-- 1 ubuntu ubuntu   761348 Dec 15 04:18 ram110_tl_9.tif
-rw-rw-r-- 1 ubuntu ubuntu    22586 Dec 15 04:18 ram110.xml

How do I ensure that latest tesstrain-extract-gt is being used?

Shreeshrii commented 3 years ago

The image should look like the following. But, in /tmp/pytest-of-ubuntu/pytest-current/test_handle_page_devanagari_wicurrent the png file as well as the generated tifs have ??? rather than the Devanagari text as per image.

ram110

The generated gt.txt is correct (i.e. it is in Devanagari script) but the images are not.

ram110_summary.gt.txt

Shreeshrii commented 3 years ago

ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?

I do not know more than the info available online. Please see https://github.com/OpenITI/RELEASE and https://zenodo.org/record/4075046#.X9hC0dgzaUk

M3ssman commented 3 years ago

@Shreeshrii Please note, test images are just created on-the-fly, with a library that is out-of-the-box just able to render a very small subset of UTF-8 chars, I guess only ASCII, neither arabic, persian, devanagari or old german fracture letters. This was introduced to keep test data small and free from binary image stuff. It only gives you a hint whether the lines would match the "words".

M3ssman commented 3 years ago

@Shreeshrii Regarding the lastest version: currently, there's only a-pre-beta-version (0.0.1) annotated in the setup.py. Usually this would be the place to follow versioning. I do not know how to utilize some sort of repository information straight at this point. Maybe @kba can give us a hint?

Shreeshrii commented 3 years ago

@M3ssman Thanks for the explanations regarding test files.

Maybe tesstrain-extract-gt in your current, active environment is outdated, so please drop it and do a fresh install afterwards.

You were right about this.

I removed tesstrain-extract-gt from the bin directories and reinstalled in the environment where ocrd is installed. It works now. All the tif and gt.txt were created for the Transkribus Devanagari file.

The alto4.1 Persian file is also generating line images and text. (I haven't checked regarding the RTL issue yet).

This is great!! Thank you.

M3ssman commented 3 years ago

@Shreeshrii You're welcome!

... Sorry for the confusion regarding RTL ... finally, it turned out that the -r flag aims at something different than real RTL which can be handled with py-bidi. If active, it only re-arranges word tokens by top-left-corner in descending order, starting from right margin. Therefore I renamed it to --reorder. It doesn't turn characters. I had to deal with arabic PAGE-XML exported from Transkribus, having inconsistent reading-orders and display artifacts and almost made me go crazy.

Since this relies on individual coordinates for each token, I'm afraid it will have no effect on test resources like the ones gathered from OpenITI which only have a single String@CONTENT element that represents a text line in total (or at least more than just one word). Reordering this way requires proper coordinates below text line level: We can't just chop the lines and reorder tokens, since the source order of elements of a plain text line is certainly not always reliable.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shreeshrii commented 3 years ago

This should not be closed. It needs review by someone familiar with RTL languages.

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging f3e73e47ca18d09ee6ba2ed3a5ea16b3f3c33620 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com

new alerts:

3 for __init__ method calls overridden method
1 for 'import *' may pollute namespace

M3ssman commented 3 years ago

I've been talking with https://github.com/galdring , a colleague, about this review and he's out to get us somebody.

stweil commented 3 years ago

@M3ssman, please check git config user.name. Your commits use that name for the author information.

M3ssman commented 3 years ago

There's also a branch with the same name (feat/generate-trainingsets) but outdated already in this repository, which I guess @kba created to commit his extensions before I integrated them and they finally went to ulb-sachsen-anhalt/tesstrain/tree/feat/generate-trainingsets. I wonder if this causes any irritations?

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging 21c718f6140ba366c68d9194509f93205717b705 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com

new alerts:

3 for __init__ method calls overridden method
1 for 'import *' may pollute namespace

kba commented 3 years ago

I wonder if this causes any irritations?

I don't think so but I deleted the branch since it is outdated as you say.

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging 23edc0685cd62c760849b6e288a58a7c9b991733 into fa57d619e239694b9d4073eaf5b9150d0b4fae68 - view on LGTM.com

new alerts:

3 for __init__ method calls overridden method
1 for 'import *' may pollute namespace

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging ea8464bc779986d9ca9dd9d28e59f2e392c9e3ea into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

3 for `__init__` method calls overridden method
1 for 'import *' may pollute namespace

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging 325d7942a516c3c980846459f2bcba2971aae59d into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

3 for `__init__` method calls overridden method
1 for 'import *' may pollute namespace

lgtm-com[bot] commented 3 years ago

This pull request introduces 4 alerts when merging cf54dd9f73b94df92af177baa70a22307473fd70 into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

3 for `__init__` method calls overridden method
1 for 'import *' may pollute namespace

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 162148dbe00f1da30d584bd2539920f43087b253 into 40334e4e8abc6ea793e481b21bcd8e076fa7a8ba - view on LGTM.com

new alerts:

1 for Unused import

zdenop commented 1 year ago

@M3ssman: can you please update your PR to current git code (python code is in src see Migrate Python code to a dedicated package)

M3ssman commented 1 year ago

@zdenop Sorry for the late reply.

What layout do you prefer? <project_root>/src/extract_sets or integrate training_sets.py somehow into <project_root>/src as part of <project_root>/src/tesstrain ?

stefan6419846 commented 1 year ago

If I understood @zdenop correctly, the final goal is to make everything available through the tesstrain Python package in the end. As you provide a dedicated entry point, src/tesstrain sounds like the appropriate package.

Nevertheless, I am not sure about the external dependencies. They might should be made optional (extras_require).

M3ssman commented 1 year ago

@stefan6419846 Thanks for your reply! Do you suggest to push these dependencies into setuptool.setup.extras_require?

stefan6419846 commented 1 year ago

@M3ssman If you are going to integrate the training set generator into the existing Python package, I would suggest yes. At least for me they appear to be overkill for most users which just want to use the basic artificial training functionality.

bertsky commented 1 year ago

Nevertheless, I am not sure about the external dependencies. They might should be made optional (extras_require).

At least for me they appear to be overkill for most users which just want to use the basic artificial training functionality.

I disagree with that assessment. The pkg for synthetic training is as relevant as some way to import from the widely used file formats (ALTO, PAGE) for real GT training IMO. So if the trainingsets extension is adopted (at all), then its dependencies should not be moved to extras_require.

tesseract-ocr / tesstrain

Feat/generate trainingsets #205