Open RaphSte opened 3 months ago
this seems to be the same error as in #24
One comment suggests, that the tika server is not running. How can I verify that?
v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.
Hey @RaphSte did the latest version work for you. If so, can you update me how?
Same issue here, latest version has an issue.
Hey @RaphSte did the latest version work for you. If so, can you update me how?
hey, @shshnk158, no, it diddn't work for me. I'll use v0.1.6 for now.
Yes v0.1.6 is working fine, but it comes with tika-server-standard-nlm-modified-2.4.1_v6.jar
, I wanted to try out with the latest jar file [tika-server-standard-nlm-modified-2.9.2_v1.jar](https://github.com/nlmatics/nlm-ingestor/blob/main/jars/tika-server-standard-nlm-modified-2.9.2_v1.jar)
any suggestions @ansukla
yes, facing this issue on v0.1.7 and v0.1.8
The issue is because paragraphs are missing metadata. PR #70 solves this issue.
While not merged, you can use it locally with git fetch origin pull/70/head:PR70
and git switch PR70
I'm facing the same issue, trying now to build the container with PR70.
Container build fails with "Failed to build pandas"
@rednag PR #73 is related, but in my case I just updated requirements.txt to have pandas >= 1.24
@vitorhirota Changing that to >=
fixes the pandas error.
Now I'm hitting a problem with python -m nltk.downloader punkt
.
➜ nlm-ingestor git:(PR70) ✗ docker build --platform=linux/x86_64 -t ohalo-nlm-ingestor .
[+] Building 2.6s (22/24) docker:desktop-linux
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.54kB 0.0s
=> resolve image config for docker.io/docker/dockerfile:experimental 0.4s
=> CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> [internal] load .dockerignore 0.0s
=> [internal] load metadata for docker.io/library/python:3.11-bookworm 0.4s
=> [ 1/16] FROM docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e 0.0s
=> => resolve docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 186.95kB 0.1s
=> CACHED [ 2/16] RUN apt-get update && apt-get -y --no-install-recommends install libgomp1 0.0s
=> CACHED [ 3/16] RUN mkdir -p /usr/share/man/man1 && apt-get update -y && apt-get install -y openjdk-17-jre-headless 0.0s
=> CACHED [ 4/16] RUN apt-get install -y libxml2-dev libxslt-dev build-essential libmagic-dev 0.0s
=> CACHED [ 5/16] RUN apt-get install -y tesseract-ocr lsb-release && echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /e 0.0s
=> CACHED [ 6/16] RUN apt-get install unzip -y && apt-get install git -y && apt-get autoremove -y 0.0s
=> CACHED [ 7/16] WORKDIR /app 0.0s
=> CACHED [ 8/16] COPY . ./ 0.0s
=> CACHED [ 9/16] RUN pip install --upgrade pip setuptools 0.0s
=> CACHED [10/16] RUN apt-get install -y libmagic1 0.0s
=> CACHED [11/16] RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts 0.0s
=> CACHED [12/16] RUN pip install -r requirements.txt 0.0s
=> CACHED [13/16] RUN python -m nltk.downloader stopwords 0.0s
=> ERROR [14/16] RUN python -m nltk.downloader punkt 1.6s
------
> [14/16] RUN python -m nltk.downloader punkt:
0.505 <frozen runpy>:128: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
0.874 [nltk_data] Downloading package punkt to /root/nltk_data...
1.525 [nltk_data] Unzipping tokenizers/punkt.zip.
1.526 [nltk_data] Error with downloaded zip file
1.526 Error installing package. Retry? [n/y/e]
1.529 Traceback (most recent call last):
1.529 File "<frozen runpy>", line 198, in _run_module_as_main
1.530 File "<frozen runpy>", line 88, in _run_code
1.530 File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 2537, in <module>
1.532 rv = downloader.download(
1.532 ^^^^^^^^^^^^^^^^^^^^
1.532 File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 790, in download
1.533 choice = input().strip()
1.533 ^^^^^^^
1.534 EOFError: EOF when reading a line
------
Dockerfile:34
--------------------
32 | RUN pip install -r requirements.txt
33 | RUN python -m nltk.downloader stopwords
34 | >>> RUN python -m nltk.downloader punkt
35 | RUN python -c "import tiktoken; tiktoken.get_encoding(\"cl100k_base\")"
36 | RUN chmod +x run.sh
--------------------
ERROR: failed to solve: process "/bin/sh -c python -m nltk.downloader punkt" did not complete successfully: exit code: 1
EDIT: Nevermind, turning on my VPN resolved this issue. I really need to switch ISPs... :)
I can confirm I can building the docker image PR #70, with pandas>=1.2.4
works and the container does not show the KeyError. Thanks!
Sorry, this one's on me. First PR was massive code refresh on top of the latest Tika and I missed some key elements that my tests didn't cover. Second PR with jar v2 should resolve it, but waiting on @ansukla or someone to merge here.
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
v0.1.8 and v0.1.7
are you talking about nlm-ingestor version. I cant see v0.1.6 there
v0.1.8 and v0.1.7
are you talking about nlm-ingestor version. I cant see v0.1.6 there
@ddose-inferyx yes, this is about the nlm ingestor version. You can either pull the image directly (see ) or build it yourself selecting the tag
I diddn't try building it myself though. I just pulled the image directly and it worked for me.
v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.
Thanks! @RaphSte , It started working for me as I used -http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes. Using "NewIndentParser=yes." will also work with the latest.
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
docker pull ghcr.io/jamesmtc/nlm-ingestor
Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied
Thank you so much. Everything is working fine with this.
-Dipak
On Wed, 24 Jul 2024 at 14:44, irash03 @.***> wrote:
Here's the docker image I'm using with everything baked in: jamesmtc/nlm-ingestor
docker pull ghcr.io/jamesmtc/nlm-ingestor
Error response from daemon: Head " https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied
— Reply to this email directly, view it on GitHub https://github.com/nlmatics/nlm-ingestor/issues/72#issuecomment-2247327052, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIY7W33X2P7H6U32BKRKKGLZN5WATAVCNFSM6AAAAABJKGY5SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBXGMZDOMBVGI . You are receiving this because you were mentioned.Message ID: @.***>
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
docker pull ghcr.io/jamesmtc/nlm-ingestor
Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied
I was getting the same error, and I just needed to reset my authentication info for ghcr. I removed any preset ghcr configs, then followed the setup instructions here. After that, running docker pull jamesmtc/nlm-ingestor:latest
worked fine
Merging changes from @jamesvillarrubia. Apologies for the delay. Thanks James for putting together the fix. Feel free to send me a note on LinkedIn if something needs attention.
When trying to run a pdf file through it I get the KeyError: 'style', with the following stacktrace:
Steps to reproduce:
(tested on linux server)
both mehtods, local and online will produce the same error
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" pdf_url = "./arxiv.org/pdf/1910.13461.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)