nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.1k stars 158 forks source link

Lost pages #55

Open sailxjx opened 7 months ago

sailxjx commented 7 months ago

pythonlearn.pdf

I used a local docker server to parse the above document, which has 239 pages. However, the ingestor only parsed 158 pages, and the remaining content was discarded. Is this a bug?

Here is the logs:

processing page: 140 Number of p_tags.... 178 processing page: 141 Number of p_tags.... 4 processing page: 142 Number of p_tags.... 251 processing page: 143 Number of p_tags.... 303 processing page: 144 Number of p_tags.... 322 processing page: 145 Number of p_tags.... 287 processing page: 146 Number of p_tags.... 330 processing page: 147 Number of p_tags.... 308 processing page: 148 Number of p_tags.... 265 processing page: 149 Number of p_tags.... 312 processing page: 150 Number of p_tags.... 298 processing page: 151 Number of p_tags.... 346 processing page: 152 Number of p_tags.... 412 processing page: 153 Number of p_tags.... 287 processing page: 154 Number of p_tags.... 193 processing page: 155 Number of p_tags.... 5 processing page: 156 192.168.65.1 - - [18/Apr/2024 14:24:54] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -