jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

add example document where characters of extracted text are poorly sp… #69

Closed sftse closed 9 months ago

sftse commented 9 months ago

…aced

sftse commented 9 months ago

When this page is extracted, ",isdefinedas" appears in the output.

Quite surprisingly, despite the underlying content stream not containing any spacing information, pdfium is able to correctly identify where to insert spaces and extracts the correct text ", is defined as"

This is what the content stream looks like "[(,i)334(sd)333.1(e)331.3('fi' ligature)334.9(n)333.1(e)331.3(da)331.3(s)]TJ"

sftse commented 9 months ago

I'm closing this until I can find a better way of minimizing the example.