jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

unexpected smask type 168 0 R #74

Closed nbittich closed 10 months ago

nbittich commented 10 months ago

got this today with a pdf i can't unfortunately share here:

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:1230:24:
unexpected smask type 168 0 R
jrmuizel commented 10 months ago

I just landed https://github.com/jrmuizel/pdf-extract/commit/379647e688ad3f693d34fe9c24a43623d0b5ab52. Does that fix it?

nbittich commented 10 months ago

thanks!! It does not panic anymore, unfortunately it outputs some weird results (missing spaces), e.g:

§sometextendingwithacomma,sometext~10%sometext.§Afullsentencewithnospace,makingup~sometext.§someothertextwithnospaces.01020304050607080\nA\n g\n r\n i\n c\n u\n l\n t\n u\n r\n e\n B\n i\n o\n s\n c\n i\n e\n n\n c\n e\n\nC\n o\n m\n m\n u\n n\n i\n c\n a\n t\n i\n o\n n\n D\n e\n v\n e\n l\n o\n p\n m\n e\n n\n t\n F\n o\n o\n d\n H\n e\n a\n l\n t\n h\n L\n a\n w\n M\n e\n d\n i\n c\n i\n n\n e\n S\n o\n c\n i\n a\n l\n A\n c\n t\n i\n v\n i\n t\n y\n T\n e\n c\n h\n n\n o\n l\n o\n g\n y\n T\n r\n a\n d\n e\n\nT\n r\n a\n n\n s\n p\n o\n r\n t\n a\n t\n i\n o\n

not panicking is already great for me, specially since you fixed it so quickly. If you'd like me to try something else, please let me know.

jrmuizel commented 10 months ago

Great.

Unfortunately, to fix the spacing errors, I'll probably need an example PDF