qurator-spk / sbb_binarization

Document Image Binarization
Apache License 2.0
67 stars 14 forks source link

broken model archive #55

Closed bertsky closed 1 year ago

bertsky commented 1 year ago

The provided model_2021_03_09 contains a bogus directory entry __MACOSX which causes (with the OCR-D wrapper):

Traceback (most recent call last):
  File "/local/ocr-d/ocrd_all/venv/bin/ocrd-sbb-binarize", line 8, in <module>
    sys.exit(cli())
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/sbb_binarize/ocrd_cli.py", line 162, in cli
    return ocrd_cli_wrap_processor(SbbBinarizeProcessor, *args, **kwargs)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 76, in run_processor
    processor = processorClass(
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/sbb_binarize/ocrd_cli.py", line 49, in __init__
    self.setup()
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/sbb_binarize/ocrd_cli.py", line 72, in setup
    self.binarizer = SbbBinarizer(model_dir=model_path, logger=LOG)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 44, in __init__
    self.models.append(self.load_model(model_file))
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 59, in load_model
    model = load_model(model_name, compile=False)
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/tensorflow/python/saved_model/loader_impl.py", line 115, in parse_saved_model
    raise IOError(
OSError: SavedModel file does not exist at: /local/ocr-d/ocrd_all/venv/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09/__MACOSX//{saved_model.pbtxt|saved_model.pb}

I don't think the code responsible for path name resolution is wrong. The bogus directory simply should not exist there. So better fix the zip file, right?

cneud commented 1 year ago

The clean saved model can be obtained from here.

bertsky commented 1 year ago

The clean saved model can be obtained from here.

@cneud do you know of any way to extract a directory or archive URL from the HF repo browser? (We need that for ocrd resmgr integration...)

kba commented 1 year ago

The clean saved model can be obtained from here.

@cneud do you know of any way to extract a directory or archive URL from the HF repo browser? (We need that for ocrd resmgr integration...)

We could implement a git-lfs resource type similar to archive but instead of download/extract, it downloads the models via LFS. If we anticipate huggingface as the primary source for models in the QURATOR/MMK projects, this might be worth the effort.

But for now, can we fix the model in qurator-data.de until this is in place?

mikegerber commented 1 year ago

I did some brainstorming for "qurator-data.de 2.0" and among the stuff that I find problematic are those archives: It's hard to produce them on the fly. It would actually be better to serve them as Git-LFS or similar. So it would be great if ocrd resmgr could download such a repo.

We have some outage here and I would like to check our original data repo (that I can't easily access right now) first before uploading...

bertsky commented 1 year ago

Note: the Huggingface repo does not help us for the SavedModel format, because it does not provide any option to download entire directories / archives. See explanation here.

cneud commented 1 year ago

It would actually be better to serve them as Git-LFS

This is what HuggingFace does though

git lfs install
git clone git@hf.co:SBB/sbb_binarization
cneud commented 1 year ago

We now provide zipped archives both for HDF5 (qurator-data) and SavedModel (GH releases) format and git lfs both for HDF5 and SavedModel (HuggingFace).

mikegerber commented 1 year ago

We now provide zipped archives both for HDF5 (qurator-data) and SavedModel (GH releases) format and git lfs both for HDF5 and SavedModel (HuggingFace).

So we have 3 repositories all more or less different?

mikegerber commented 1 year ago

Note: the Huggingface repo does not help us for the SavedModel format, because it does not provide any option to download entire directories / archives. See explanation here.

Does it not allow for a checkout via Git-LFS? (Which I consider better than downloading an archive, tbh. Only downside is that ocrd resmgr does not support it yet)

cneud commented 1 year ago

So we have 3 repositories all more or less different?

We had to use the GitHub assets for OCR-D since the SavedModel is not available on qurator-data.de. Otherwise it would be two as usual, qurator-data.de and HuggingFace.

cneud commented 1 year ago

Does it not allow for a checkout via Git-LFS?

https://github.com/qurator-spk/sbb_binarization/issues/55#issuecomment-1507723976

OCR-D resmgr requires an archive though.

mikegerber commented 1 year ago

OCR-D resmgr requires an archive though.

I tried to explain this earlier in this thread: It would be nice if OCR-D resmgr would support it, we could then consider offering our downloads as Git LFS as well. This would nicely solve a few problems (like having to provide an archive)

bertsky commented 1 year ago

OCR-D resmgr requires an archive though.

I tried to explain this earlier in this thread: It would be nice if OCR-D resmgr would support it, we could then consider offering our downloads as Git LFS as well. This would nicely solve a few problems (like having to provide an archive)

The only way to get to the files besides archive download URL and manual browsing is via git checkout. We could start supporting git URLs in resmgr by cloning into a temporary location and extracting the files similar to the archive option (controlled by path_in_archive). But there still is the extra complication of LFS.

@kba, do we want to support this?

kba commented 1 year ago

@kba, do we want to support this?

If we find a reliable pythonic way (possibly https://pypi.org/project/git-lfs/) to implement fetching from Git LFS without users having to install LFS, then yes.

But I think that downloadable archives are a sensible addition in any case to make it easier for people to use the models without knowing Git LFS.

mikegerber commented 1 year ago

But I think that downloadable archives are a sensible addition in any case

The problem with it is that with these file sizes, they are hard to create on-the-fly without opening up to DoS problems. And people who are not able to use Git LFS can still download the individual files.

Why I'd want to do this? I'd like to have the same set of files everywhere to keep stuff in sync (which is a major between-the-keyboard-and-chair problem) and to easily compare. Not zips/tarfiles packed in various ways.

bertsky commented 1 year ago

But I think that downloadable archives are a sensible addition in any case to make it easier for people to use the models without knowing Git LFS.

You meant downloadable repositories, right?