Processors should work with the default models/resources - Githubissues

qurator-spk / ocrd-galley

A Dockerized test environment for OCR-D processors 🚢

Apache License 2.0

7 stars 1 forks source link

Processors should work with the default models/resources #46

Open mikegerber opened 3 years ago

mikegerber commented 3 years ago

Calling ocrd-calamari-recognize without checkpoint currently yields:

15:06:34.301 ERROR ocrd.ocrd-calamari-recognize.resolve_resource - Could not find resource 'qurator-gt4histocr-1.0' for executable 'ocrd-calamari-recognize'. Try 'ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0' to download this resource.

[ ] ocrd_calamari
[ ] ocrd_tesserocr?
[ ] sbb_binarization
[ ] sbb_textline_detection?
[ ] eynollah?
[ ] Consider how OCR-D resources currently reflect our model versions

kba commented 3 years ago

Calling ocrd-calamari-recognize without checkpoint currently yields:

We had on-demand downloading in the resource lookup mechanism but we removed it because there are too many situations where you do not want that behavior and considered it more reasonable to log an error with an ocrd resmgr download call to remedy it.

mikegerber commented 3 years ago

Calling ocrd-calamari-recognize without checkpoint currently yields:

We had on-demand downloading in the resource lookup mechanism but we removed it because there are too many situations where you do not want that behavior and considered it more reasonable to log an error with an ocrd resmgr download call to remedy it.

Yeah, I wouldn't want that automatic downloading anyway. It's just that we put the default model into the container image already, so we could put it a the expected default location!

kba commented 3 years ago

so we could put it a the expected default location

Understood. The easiest way in Docker is to put the resources in /usr/local/share/ocrd-resources/<name-of-processor> or to define XDG_DATA_HOME to a place of your convenience in the Dockerfile and put the models in $XDG_DATA_HOME/ocrd-resources/<name-of-procesor>.

Consider that OCR-D resources currently don't reflect our model versions

They should though, so if you point me to any new models not yet in the registry, I'll gladly add them.

mikegerber commented 3 years ago

Consider that OCR-D resources currently don't reflect our model versions They should though, so if you point me to any new models not yet in the registry, I'll gladly add them.

That's one of my famously badly phrased TODO items: I have to check it out first, and see

mikegerber commented 1 year ago

Minimal support is now in test/github-actions - I am going to merge this to master soon. Needs docs because it is currently not easy to use.

mikegerber commented 1 year ago

In test/github-actions, this now works:

ocrd resmgr download -a ocrd-sbb-textline-detector default
ocrd-sbb-textline-detector -I OCR-D-IMG-BIN-TEST-OLENA -O OCRD-D-SEG-LINE-TEST-SBB-TEXTLINE -P model ~/.local/share/ocrd-resources/ocrd-sbb-textline-detector/default

It could be improved but for now it's ok. ocrd (first) does not run with the correct container image to find ocrd-sbb-textline-detector and so relies on the central resource list.

[x] The wrapper could - as a workaround - detect that we're calling ocrd resmgr and use the correct image. It does not seem much of a hack because all the image are based on the same core image anyway. And as OCR-D plans to use thin containers in the future (making ocrd-galley obsolete/redundant) this is probably the smart solution (i.e. don't invest too much engineering in this.)

@kba FYI, this was the problem I had discussed with you today.

mikegerber commented 1 year ago

* [ ]  The wrapper could - as a workaround - detect that we're calling `ocrd resmgr` and use the correct image. It does not seem much of a hack because all the image are based on the same core image anyway. And as OCR-D plans to use thin containers in the future (making ocrd-galley obsolete/redundant) this is probably the smart solution (i.e. don't invest too much engineering in this.)

We need this now, because we can't download resources for ocrd_tesserocr currently (no entries in the central list anymore).

mikegerber commented 1 year ago

We now run the correct image for ocrd resmgr download and list-available by looking at the arguments and matching against the list of executables.

mikegerber commented 1 year ago

ocrd_tesserocr seems to be special now as it/ocrd resmgr uses Tesseract's /usr/local/tessdata by default. This is somewhat inconsistent with the other processors, but @bertsky had his reasons and we try to make the default work.

[ ] Support this

bertsky commented 1 year ago

ocrd_tesserocr seems to be special now as it/ocrd resmgr uses Tesseract's /usr/local/tessdata by default. This is somewhat inconsistent with the other processors, but @bertsky had his reasons and we try to make the default work.

Yes, we started supporting the module location (besides data, system and cwd) for processors that come with preinstalled models or configs (like preset files) – think distutils. Then we immediately switched ocrd-tesserocr over to this model, because previously we had this unfortunate situation that the standalone CLI uses a precompiled resource location (tessdata-dir), while our wrapper needed the OCR-D locations. So now, whatever you've used as PREFIX to install Tesseract (from OS it would be something like /usr/share/tesseract-ocr/4.00/tessdata, from ocrd_all it would be venv/share/tessdata) – that will be used for OCR-D, too.

So it's not at all inconsistent, just another possibility that we support. The typegroups classifier also uses its (Python) module location. And bashlib processors obviously need this, too. (For example, all of workflow-configuration's XSL scripts.)

BTW, since this issue is also about model download in CI: have a look at my ocrd_detectron2 CI, where I use a Github Action to cache the ocrd resmgr download results – shared between branches and runs. (I initially tried to use Github Action artifacts for this, but there you are only allowed to store 500 MB, which is ridiculous. The Cache only has 10 GB, but that's enough in my case.)