Open mikegerber opened 3 years ago
Calling ocrd-calamari-recognize without checkpoint currently yields:
We had on-demand downloading in the resource lookup mechanism but we removed it because there are too many situations where you do not want that behavior and considered it more reasonable to log an error with an ocrd resmgr download
call to remedy it.
Calling ocrd-calamari-recognize without checkpoint currently yields:
We had on-demand downloading in the resource lookup mechanism but we removed it because there are too many situations where you do not want that behavior and considered it more reasonable to log an error with an
ocrd resmgr download
call to remedy it.
Yeah, I wouldn't want that automatic downloading anyway. It's just that we put the default model into the container image already, so we could put it a the expected default location!
so we could put it a the expected default location
Understood. The easiest way in Docker is to put the resources in /usr/local/share/ocrd-resources/<name-of-processor>
or to define XDG_DATA_HOME
to a place of your convenience in the Dockerfile and put the models in $XDG_DATA_HOME/ocrd-resources/<name-of-procesor>
.
Consider that OCR-D resources currently don't reflect our model versions
They should though, so if you point me to any new models not yet in the registry, I'll gladly add them.
Consider that OCR-D resources currently don't reflect our model versions They should though, so if you point me to any new models not yet in the registry, I'll gladly add them.
That's one of my famously badly phrased TODO items: I have to check it out first, and see
Minimal support is now in test/github-actions - I am going to merge this to master soon. Needs docs because it is currently not easy to use.
In test/github-actions, this now works:
ocrd resmgr download -a ocrd-sbb-textline-detector default
ocrd-sbb-textline-detector -I OCR-D-IMG-BIN-TEST-OLENA -O OCRD-D-SEG-LINE-TEST-SBB-TEXTLINE -P model ~/.local/share/ocrd-resources/ocrd-sbb-textline-detector/default
It could be improved but for now it's ok. ocrd
(first) does not run with the correct container image to find ocrd-sbb-textline-detector
and so relies on the central resource list.
ocrd resmgr
and use the correct image. It does not seem much of a hack because all the image are based on the same core image anyway. And as OCR-D plans to use thin containers in the future (making ocrd-galley obsolete/redundant) this is probably the smart solution (i.e. don't invest too much engineering in this.)@kba FYI, this was the problem I had discussed with you today.
* [ ] The wrapper could - as a workaround - detect that we're calling `ocrd resmgr` and use the correct image. It does not seem much of a hack because all the image are based on the same core image anyway. And as OCR-D plans to use thin containers in the future (making ocrd-galley obsolete/redundant) this is probably the smart solution (i.e. don't invest too much engineering in this.)
We need this now, because we can't download resources for ocrd_tesserocr currently (no entries in the central list anymore).
We now run the correct image for ocrd resmgr
download
and list-available
by looking at the arguments and matching against the list of executables.
ocrd_tesserocr seems to be special now as it/ocrd resmgr
uses Tesseract's /usr/local/tessdata
by default. This is somewhat inconsistent with the other processors, but @bertsky had his reasons and we try to make the default work.
ocrd_tesserocr seems to be special now as it/
ocrd resmgr
uses Tesseract's/usr/local/tessdata
by default. This is somewhat inconsistent with the other processors, but @bertsky had his reasons and we try to make the default work.
Yes, we started supporting the module
location (besides data
, system
and cwd
) for processors that come with preinstalled models or configs (like preset files) – think distutils. Then we immediately switched ocrd-tesserocr over to this model, because previously we had this unfortunate situation that the standalone CLI uses a precompiled resource location (tessdata-dir), while our wrapper needed the OCR-D locations. So now, whatever you've used as PREFIX
to install Tesseract (from OS it would be something like /usr/share/tesseract-ocr/4.00/tessdata, from ocrd_all it would be venv/share/tessdata) – that will be used for OCR-D, too.
So it's not at all inconsistent, just another possibility that we support. The typegroups classifier also uses its (Python) module location. And bashlib processors obviously need this, too. (For example, all of workflow-configuration's XSL scripts.)
BTW, since this issue is also about model download in CI: have a look at my ocrd_detectron2 CI, where I use a Github Action to cache the ocrd resmgr download
results – shared between branches and runs. (I initially tried to use Github Action artifacts for this, but there you are only allowed to store 500 MB, which is ridiculous. The Cache only has 10 GB, but that's enough in my case.)
Calling ocrd-calamari-recognize without
checkpoint
currently yields: