Closed bertsky closed 1 year ago
The clean saved model can be obtained from here.
@cneud do you know of any way to extract a directory or archive URL from the HF repo browser? (We need that for ocrd resmgr integration...)
The clean saved model can be obtained from here.
@cneud do you know of any way to extract a directory or archive URL from the HF repo browser? (We need that for ocrd resmgr integration...)
We could implement a git-lfs
resource type similar to archive
but instead of download/extract, it downloads the models via LFS. If we anticipate huggingface as the primary source for models in the QURATOR/MMK projects, this might be worth the effort.
But for now, can we fix the model in qurator-data.de
until this is in place?
I did some brainstorming for "qurator-data.de 2.0" and among the stuff that I find problematic are those archives: It's hard to produce them on the fly. It would actually be better to serve them as Git-LFS or similar. So it would be great if ocrd resmgr
could download such a repo.
We have some outage here and I would like to check our original data repo (that I can't easily access right now) first before uploading...
Note: the Huggingface repo does not help us for the SavedModel format, because it does not provide any option to download entire directories / archives. See explanation here.
It would actually be better to serve them as Git-LFS
This is what HuggingFace does though
git lfs install
git clone git@hf.co:SBB/sbb_binarization
We now provide zipped archives both for HDF5 (qurator-data) and SavedModel (GH releases) format and git lfs both for HDF5 and SavedModel (HuggingFace).
We now provide zipped archives both for HDF5 (qurator-data) and SavedModel (GH releases) format and git lfs both for HDF5 and SavedModel (HuggingFace).
So we have 3 repositories all more or less different?
Note: the Huggingface repo does not help us for the SavedModel format, because it does not provide any option to download entire directories / archives. See explanation here.
Does it not allow for a checkout via Git-LFS? (Which I consider better than downloading an archive, tbh. Only downside is that ocrd resmgr does not support it yet)
So we have 3 repositories all more or less different?
We had to use the GitHub assets for OCR-D since the SavedModel is not available on qurator-data.de. Otherwise it would be two as usual, qurator-data.de and HuggingFace.
Does it not allow for a checkout via Git-LFS?
https://github.com/qurator-spk/sbb_binarization/issues/55#issuecomment-1507723976
OCR-D resmgr requires an archive though.
OCR-D resmgr requires an archive though.
I tried to explain this earlier in this thread: It would be nice if OCR-D resmgr would support it, we could then consider offering our downloads as Git LFS as well. This would nicely solve a few problems (like having to provide an archive)
OCR-D resmgr requires an archive though.
I tried to explain this earlier in this thread: It would be nice if OCR-D resmgr would support it, we could then consider offering our downloads as Git LFS as well. This would nicely solve a few problems (like having to provide an archive)
The only way to get to the files besides archive download URL and manual browsing is via git checkout. We could start supporting git URLs in resmgr by cloning into a temporary location and extracting the files similar to the archive option (controlled by path_in_archive
). But there still is the extra complication of LFS.
@kba, do we want to support this?
@kba, do we want to support this?
If we find a reliable pythonic way (possibly https://pypi.org/project/git-lfs/) to implement fetching from Git LFS without users having to install LFS, then yes.
But I think that downloadable archives are a sensible addition in any case to make it easier for people to use the models without knowing Git LFS.
But I think that downloadable archives are a sensible addition in any case
The problem with it is that with these file sizes, they are hard to create on-the-fly without opening up to DoS problems. And people who are not able to use Git LFS can still download the individual files.
Why I'd want to do this? I'd like to have the same set of files everywhere to keep stuff in sync (which is a major between-the-keyboard-and-chair problem) and to easily compare. Not zips/tarfiles packed in various ways.
But I think that downloadable archives are a sensible addition in any case to make it easier for people to use the models without knowing Git LFS.
You meant downloadable repositories, right?
The provided model_2021_03_09 contains a bogus directory entry
__MACOSX
which causes (with the OCR-D wrapper):I don't think the code responsible for path name resolution is wrong. The bogus directory simply should not exist there. So better fix the zip file, right?