qurator-spk / sbb_binarization

Document Image Binarization
Apache License 2.0
67 stars 14 forks source link

update to GH release archive as model URL #59

Closed bertsky closed 1 year ago

bertsky commented 1 year ago

@cneud I also updated the second URL for the older, 4-fold model. Nearly tripped myself over the subdirectory problem (loader needs a subdirectory, and because the old 4-fold model already contains submodel directories, no problem, but the newer does not, so path_in_archive must be just .).

bertsky commented 1 year ago

Do you want me to add some CI BTW?

cneud commented 1 year ago

This was something I wanted to check with the repacked archive before merging, since I am running into a dir/subdir problem also with Eynollah and the repacked models. I only just now learned about "path_in_archive"!

I was surprised there was no CI here after wrestling with it over there, so yeah, anything in that regard is welcome.

cneud commented 1 year ago

We should have long gotten rid of that subrepo for the test...

bertsky commented 1 year ago

We should have long gotten rid of that subrepo for the test...

Not sure about that. I do find the subrepo pattern useful, but tests must not use the clone directly.

I initially did not see there's already a CircleCI so I set up a GH Action. Then I also updated the CircleCI config (renaming its worklows). Now it seems both succeed, only the old CircleCI config (identified by the old workflow name) – presumably on master – does not.

cneud commented 1 year ago

Much appreciate the thoroughness!

bertsky commented 1 year ago

Fixes #55 Fixes #50 Fixes #58

cneud commented 1 year ago

Many thanks! Will also test this out locally in a bit.

Btw I managed to get rid of the CircleCi tests getting killed for Eynollah by switching from the docker executor to a linux vm executor (which supposedly has more RAM).

cneud commented 1 year ago

Locally on Python 3.8 the test fails for me, I am investigating.

Btw I think only the newer 2021-03-09 model should be downloaded and installed by default, as this saves ~500MB.

(sbb_binarization-1): $ make test ocrd resmgr download ocrd-sbb-binarize "" 20:12:23.858 INFO ocrd.cli.resmgr - Downloading registered resource 'default-2021-03-09' (https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip) [------------------------------------] 0% 20:12:26.515 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip to download.tar.xx [####################################] 100%
20:12:37.520 INFO ocrd.resource_manager.download - Extracting application/zip archive to /tmp/tmpxr178zws/out 20:12:38.566 INFO ocrd.resource_manager.download - Copying '.' from archive to /home/cnd/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09 20:12:38.707 INFO ocrd.cli.resmgr - Installed resource https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip under /home/cnd/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09 20:12:38.707 INFO ocrd.cli.resmgr - Use in parameters as 'default-2021-03-09' 20:12:38.707 INFO ocrd.cli.resmgr - Downloading registered resource 'default' (https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2020_01_16.zip) [------------------------------------] 0% 20:12:38.708 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2020_01_16.zip to download.tar.xx [####################################] 100%
20:13:23.277 INFO ocrd.resource_manager.download - Extracting application/zip archive to /tmp/tmpfih8exg7/out 20:13:26.987 INFO ocrd.resource_manager.download - Copying 'saved_model_2020_01_16' from archive to /home/cnd/.local/share/ocrd-resources/ocrd-sbb-binarize/default 20:13:27.496 INFO ocrd.cli.resmgr - Installed resource https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2020_01_16.zip under /home/cnd/.local/share/ocrd-resources/ocrd-sbb-binarize/default 20:13:27.496 INFO ocrd.cli.resmgr - Use in parameters as 'default' ocrd-sbb-binarize -m test/assets/kant_aufklaerung_1784/data/mets.xml -I OCR-D-IMG -O BIN -P model default Traceback (most recent call last): File "/home/cnd/tmp/sbb_binarization-1/bin/ocrd-sbb-binarize", line 11, in load_entry_point('sbb-binarization', 'console_scripts', 'ocrd-sbb-binarize')() File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/click/core.py", line 1130, in call return self.main(
args, kwargs) File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, kwargs) File "/home/cnd/git/sbb_binarization/sbb_binarize/ocrd_cli.py", line 162, in cli return ocrd_cli_wrap_processor(SbbBinarizeProcessor, *args, *kwargs) File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/ocrd/decorators/init.py", line 68, in ocrd_cli_wrap_processor workspace = resolver.workspace_from_url(mets, working_dir) File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/ocrd/resolver.py", line 166, in workspace_from_url self.download_to_directory(dst_dir, mets_url, basename=mets_basename, if_exists='overwrite' if clobber_mets else 'skip') File "/home/cnd/tmp/sbb_binarization-1/lib/python3.8/site-packages/ocrd/resolver.py", line 82, in download_to_directory raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url) FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: /home/cnd/git/sbb_binarization/test/assets/kant_aufklaerung_1784/data/mets.xml make: [Makefile:38: test] Error 1

bertsky commented 1 year ago

Locally on Python 3.8 the test fails for me, I am investigating.

judging by your log snippet I would say that what's missing is a git submodule update --init prior to the make test. I guess you already had a directory repo/assets but no checkout. (Same is true on the CI, hence the extra git commands there.)

Try:

make clean
git submodule update --init
make test

Btw I think only the newer 2021-03-09 model should be downloaded and installed by default, as this saves ~500MB.

Sure, I just thought the test should be thorough (testing both the simple and the multi-scale model). The models are cached on CI but that's no different than downloading the files from a fast server.

Feel free to change the tests accordingly, or let me know if I should do it.

cneud commented 1 year ago

OK, I was using only the Makefile and did

make install make models make repo/assets make test/assets make test

I was missing a git submodule update --init. Now all tests complete. Shouldn't the submodule init also be included in the Makefile targets?

bertsky commented 1 year ago

Ah, nice timing!

Shouldn't the submodule init also be included in the Makefile targets?

It is – but it only depends on repo/assets. If that exists, but is empty, then nothing happens.

cneud commented 1 year ago

Perfect timing ;)

I just thought the test should be thorough (testing both the simple and the multi-scale model)

Agreed but since a new better hybrid model is already here, we should also think about phasing out supporting older ones - at some point.

cneud commented 1 year ago

If that exists, but is empty, then nothing happens.

I started with a fresh venv, cloned the repo and checked out your change-model-url branch, then ran the steps above and would have assumed to have a working test. I only found out about the git submodule update --init from the CI.

Edit: I did run make repo/assets and make test/assets, so what was going on here?

cneud commented 1 year ago

Found this in my history, my fault

make test cp -r -t test/assets repo/assets/data/ cp: cannot stat 'repo/assets/data/': No such file or directory make: *** [Makefile:33: test/assets] Error 1

I did run test once before make repo/assets.

bertsky commented 1 year ago

I started with a fresh venv, cloned the repo and checked out your change-model-url branch, then ran the steps above and would have assumed to have a working test. I only found out about the git submodule update --init from the CI.

Edit: I did run make repo/assets and make test/assets, so what was going on here?

Agreed, it should have just worked. Turns out that a repo with submodules always yields empty subdirectories unless you clone recursively. I fixed it by making the rule depend on a subdirectory (part of assets). That also allows simplifying the CI rules.

bertsky commented 1 year ago

@cneud you might wanna do a new release (for the changes under unreleased) on Github (not the models, only the code) and PyPI.

cneud commented 1 year ago

do a new release

Yes, asap!

bertsky commented 1 year ago

Btw I managed to get rid of the CircleCi tests getting killed for Eynollah by switching from the docker executor to a linux vm executor (which supposedly has more RAM).

You mean this change? According to the docs this allows for 15 GB instead of just 8 GB RAM on the free tier. Should we give it a try here (on master)? (Will probably have to install Python first then.)

cneud commented 1 year ago

Yes, with this to set the Python version. Based on the docs for machine executors, the medium linux-vm has 7.5 GB, which was sufficient to make the tests for eynollah complete.

cneud commented 1 year ago

do a new release

https://github.com/qurator-spk/sbb_binarization/releases/tag/v0.1.0 with thanks to @kba!

bertsky commented 1 year ago

Yes, with this to set the Python version.

ah, understood!

Based on the docs for machine executors, the medium linux-vm has 7.5 GB, which was sufficient to make the tests for eynollah complete.

Ok, or we first just try to set resource_class: large on the docker executor – perhaps that's already enough?

cneud commented 1 year ago

Can we use large with a free account? I guess we can try...

bertsky commented 1 year ago

Yes, it says so.

cneud commented 1 year ago

That does require double the credits (20 credits/min) used compared to the Linux VM (10 credits/min) though.

bertsky commented 1 year ago

Indeed!