huridocs / pdf-document-layout-analysis

A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on.
Apache License 2.0
187 stars 23 forks source link

Hugging face hosted models #96

Open littlebuddha16 opened 6 days ago

littlebuddha16 commented 6 days ago

Access to hugging face is restriced on our remote servers. Is there a workaround?

ali6parmak commented 5 days ago

Hi, maybe you can try these steps:

https://github.com/huridocs/pdf-document-layout-analysis

Then, you can

make start

And a new image will be created for you.

Best

littlebuddha16 commented 1 day ago

As I understand, clone the repository, run the src/download_models.py in a computer that has access to hugging face and then copy the repo to the remote server and run make start, is that right? If it's right, where do I need to run make start command and when should I run the following docker command?

docker run --rm --name pdf-document-layout-analysis --gpus '"device=0"' -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.19

ali6parmak commented 1 day ago

This command runs the already existing docker image from dockerhub:

docker run --rm --name pdf-document-layout-analysis --gpus '"device=0"' -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.19

But since your remote servers do not have access to huggingface, when you run this command, the models cannot be downloaded automatically. So, you should not use the above command.

Instead, you should run src/download_models.py when you are in pdf-document-layout-analysis project, this will download the models. After running this script, while you are still in pdf-document-layout-analysis, run make start. This will create a new docker image for you, with the models you have already downloaded.

So, if you download the models, and copy the models folder (this folder will be automatically created after you run src/download_models.py) to your remote server and run make start there, there will be a new image created for your remote server too.

You can also refer to building from source, which is only three lines: https://github.com/huridocs/pdf-document-layout-analysis?tab=readme-ov-file#build-from-source

littlebuddha16 commented 1 day ago

@ali6parmak Thank you! I'm still getting following error. Could you help me out?

[+] Building 190.4s (24/24) FINISHED                                                                                               docker:default
 => [pdf-document-layout-analysis-gpu internal] load build definition from Dockerfile                                                        0.0s
 => => transferring dockerfile: 1.04kB                                                                                                       0.0s
 => [pdf-document-layout-analysis-gpu internal] load metadata for docker.io/pytorch/pytorch:2.4.0-cuda11.8-cudnn9-runtime                    0.9s
 => [pdf-document-layout-analysis-gpu internal] load .dockerignore                                                                           0.0s
 => => transferring context: 81B                                                                                                             0.0s
 => [pdf-document-layout-analysis-gpu  1/18] FROM docker.io/pytorch/pytorch:2.4.0-cuda11.8-cudnn9-runtime@sha256:58a28ab734f23561aa146fbaf7  0.0s
 => => resolve docker.io/pytorch/pytorch:2.4.0-cuda11.8-cudnn9-runtime@sha256:58a28ab734f23561aa146fbaf777fb319a953ca1e188832863ed57d510c9f  0.0s
 => => sha256:58a28ab734f23561aa146fbaf777fb319a953ca1e188832863ed57d510c9f197 1.37kB / 1.37kB                                               0.0s
 => => sha256:76e5e98ec29501e94739cafb6daa580774619fa92b6c4d71efade219a23b4b22 4.67kB / 4.67kB                                               0.0s
 => [pdf-document-layout-analysis-gpu internal] load build context                                                                           6.8s
 => => transferring context: 2.55GB                                                                                                          6.8s
 => [pdf-document-layout-analysis-gpu  2/18] RUN apt-get update                                                                              7.3s
 => [pdf-document-layout-analysis-gpu  3/18] RUN apt-get install --fix-missing -y -q --no-install-recommends libgomp1 ffmpeg libsm6 libxex  21.9s 
 => [pdf-document-layout-analysis-gpu  4/18] RUN mkdir -p /app/src                                                                           0.1s 
 => [pdf-document-layout-analysis-gpu  5/18] RUN mkdir -p /app/models                                                                        0.2s 
 => [pdf-document-layout-analysis-gpu  6/18] RUN addgroup --system python && adduser --system --group python                                 0.2s 
 => [pdf-document-layout-analysis-gpu  7/18] RUN chown -R python:python /app                                                                 0.2s 
 => [pdf-document-layout-analysis-gpu  8/18] RUN python -m venv /app/.venv                                                                   2.6s 
 => [pdf-document-layout-analysis-gpu  9/18] COPY requirements.txt requirements.txt                                                          0.0s 
 => [pdf-document-layout-analysis-gpu 10/18] RUN pip install --upgrade pip                                                                   1.6s
 => [pdf-document-layout-analysis-gpu 11/18] RUN pip --default-timeout=1000 install -r requirements.txt                                     89.1s 
 => [pdf-document-layout-analysis-gpu 12/18] WORKDIR /app                                                                                    0.0s 
 => [pdf-document-layout-analysis-gpu 13/18] RUN cd src; git clone https://github.com/facebookresearch/detectron2;                           1.1s 
 => [pdf-document-layout-analysis-gpu 14/18] RUN cd src/detectron2; git checkout 70f454304e1a38378200459dd2dbca0f0f4a5ab4; python setup.py  43.9s 
 => [pdf-document-layout-analysis-gpu 15/18] COPY ./start.sh ./start.sh                                                                      0.0s 
 => [pdf-document-layout-analysis-gpu 16/18] COPY ./src/. ./src                                                                              0.0s 
 => [pdf-document-layout-analysis-gpu 17/18] COPY ./models/. ./models/                                                                       3.3s 
 => [pdf-document-layout-analysis-gpu 18/18] RUN python src/download_models.py                                                               0.4s 
 => [pdf-document-layout-analysis-gpu] exporting to image                                                                                   17.5s 
 => => exporting layers                                                                                                                     17.5s 
 => => writing image sha256:e875c632b2e151974a11b3e1f9a78eadea0ccc2c402545c984ed3bfc75bc7e8c                                                 0.0s
 => => naming to docker.io/library/pdf-document-layout-analysis-pdf-document-layout-analysis-gpu                                             0.0s
 => [pdf-document-layout-analysis-gpu] resolving provenance for metadata file                                                                0.0s
[+] Running 2/1
 ✔ Network pdf-document-layout-analysis_default  Created                                                                                     0.1s 
 ✔ Container pdf-document-layout-analysis        Created                                                                                     0.0s 
Attaching to pdf-document-layout-analysis
pdf-document-layout-analysis  | [2024-11-24 18:24:03 +0000] [7] [INFO] Starting gunicorn 22.0.0
pdf-document-layout-analysis  | [2024-11-24 18:24:03 +0000] [7] [INFO] Listening at: http://0.0.0.0:5060 (7)
pdf-document-layout-analysis  | [2024-11-24 18:24:03 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
pdf-document-layout-analysis  | [2024-11-24 18:24:03 +0000] [8] [INFO] Booting worker with pid: 8
pdf-document-layout-analysis  | /app/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
pdf-document-layout-analysis  |   warnings.warn(
pdf-document-layout-analysis  | [2024-11-24 18:25:06 +0000] [8] [ERROR] Exception in worker process
pdf-document-layout-analysis  | Traceback (most recent call last):
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/arbiter.py", line 609, in spawn_worker
pdf-document-layout-analysis  |     worker.init_process()
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/uvicorn/workers.py", line 75, in init_process
pdf-document-layout-analysis  |     super().init_process()
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 134, in init_process
pdf-document-layout-analysis  |     self.load_wsgi()
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
pdf-document-layout-analysis  |     self.wsgi = self.app.wsgi()
pdf-document-layout-analysis  |                 ^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/app/base.py", line 67, in wsgi
pdf-document-layout-analysis  |     self.callable = self.load()
pdf-document-layout-analysis  |                     ^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
pdf-document-layout-analysis  |     return self.load_wsgiapp()
pdf-document-layout-analysis  |            ^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
pdf-document-layout-analysis  |     return util.import_app(self.app_uri)
pdf-document-layout-analysis  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/gunicorn/util.py", line 371, in import_app
pdf-document-layout-analysis  |     mod = importlib.import_module(module)
pdf-document-layout-analysis  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/opt/conda/lib/python3.11/importlib/__init__.py", line 126, in import_module
pdf-document-layout-analysis  |     return _bootstrap._gcd_import(name[level:], package, level)
pdf-document-layout-analysis  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
pdf-document-layout-analysis  |   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
pdf-document-layout-analysis  |   File "/app/src/app.py", line 9, in <module>
pdf-document-layout-analysis  |     from pdf_layout_analysis.run_pdf_layout_analysis import analyze_pdf
pdf-document-layout-analysis  |   File "/app/src/pdf_layout_analysis/run_pdf_layout_analysis.py", line 16, in <module>
pdf-document-layout-analysis  |     from vgt.create_word_grid import create_word_grid, remove_word_grids
pdf-document-layout-analysis  |   File "/app/src/vgt/create_word_grid.py", line 14, in <module>
pdf-document-layout-analysis  |     tokenizer = BrosTokenizer.from_pretrained("naver-clova-ocr/bros-base-uncased")
pdf-document-layout-analysis  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
pdf-document-layout-analysis  |     raise EnvironmentError(
pdf-document-layout-analysis  | OSError: Can't load tokenizer for 'naver-clova-ocr/bros-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'naver-clova-ocr/bros-base-uncased' is the correct path to a directory containing all relevant files for a BrosTokenizer tokenizer.
pdf-document-layout-analysis  | [2024-11-24 18:25:06 +0000] [8] [INFO] Worker exiting (pid: 8)
pdf-document-layout-analysis  | [2024-11-24 18:25:07 +0000] [7] [ERROR] Worker (pid:8) exited with code 3
pdf-document-layout-analysis  | [2024-11-24 18:25:07 +0000] [7] [ERROR] Shutting down: Master
pdf-document-layout-analysis  | [2024-11-24 18:25:07 +0000] [7] [ERROR] Reason: Worker failed to boot.
ali6parmak commented 1 day ago

image

Looks like there is a problem with the models downloaded from huggingface. Try to run download_models.py again and make sure your models folder looks like as in the image.

littlebuddha16 commented 4 hours ago

Same error message @ali6parmak :(

Looks like it's unable to load below tokenizer, naver-clova-ocr/bros-base-uncased

  from pdf_layout_analysis.run_pdf_layout_analysis import analyze_pdf
pdf-document-layout-analysis  |   File "/app/src/pdf_layout_analysis/run_pdf_layout_analysis.py", line 16, in <module>
pdf-document-layout-analysis  |     from vgt.create_word_grid import create_word_grid, remove_word_grids
pdf-document-layout-analysis  |   File "/app/src/vgt/create_word_grid.py", line 14, in <module>
pdf-document-layout-analysis  |     tokenizer = BrosTokenizer.from_pretrained("naver-clova-ocr/bros-base-uncased")
pdf-document-layout-analysis  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pdf-document-layout-analysis  |   File "/app/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
pdf-document-layout-analysis  |     raise EnvironmentError(
pdf-document-layout-analysis  | OSError: Can't load tokenizer for 'naver-clova-ocr/bros-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'naver-clova-ocr/bros-base-uncased' is the correct path to a directory containing all relevant files for a BrosTokenizer tokenizer.