jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Text result is split by spacing #79

Closed Implocell closed 8 months ago

Implocell commented 8 months ago

When running a pdf through it returns the text but in a weird spaced format. Is this intended or is it me that is doing something wrong? This just a random pdf from wikipedia pdf api

Code:

pub fn read_pdf() {
    let bytes = std::fs::read("2005_BBC_strike.pdf").unwrap();
    let out = pdf_extract::extract_text_from_mem(&bytes).unwrap();

    println!("{}", out);
}

Result:

2 0 0 5  B B C  s t r ik e

T he   2005  B B C   s tr ik e   w a s   a   s trike   of   m or e   tha n  1 1,000  B B C   w or ke rs ,  ove r  a   pr opos a l  to  c ut  4,000  jobs ,

a nd  to  pr iva tis e   pa rts   of   the   B B C   unde r  the   m a na ge m e nt   of   M a rk  T hom ps on . [ 1 ] [ 2 ]   M uc h  of   B B C 's   re gula r

pr ogr a m m ing  w a s   a ffe c te d,  w ith  m a ny   pr ogr a m s   be ing  re pl a c e d  w ith  pr e -re c or dings ,  a nd  s om e   be ing
jrmuizel commented 8 months ago

This isn't intended. Can you provide a link to the pdf?

Implocell commented 8 months ago

Here is the file in question, I've tried other files which does not come from wikipedia with the same result. 2005_BBC_strike.pdf

Implocell commented 8 months ago

Some files also return the following error: unexpected smask type 933 0 R Do you want an example of one of these as well? If there is anything I can do, please let me know

jrmuizel commented 8 months ago

Yes a link for the smask problem would be great too.

Implocell commented 8 months ago

This is created by me in Word (I think) and exported, just a simple line of text. TOP HEMMELIG.pdf

jrmuizel commented 8 months ago

Do you see the smask problem in TOP.HEMMELIG.pdf? It works for me.

jrmuizel commented 8 months ago

It looks like there's a problem with the text transforms that might be causing this.

jrmuizel commented 8 months ago

Fixed by bf92c9b59c0bcc1d4a2ac1761dfd3fd913987987