ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
542 stars 70 forks source link

Fatal R error when attempting to extract text from a PDF that includes a particular mathematical symbol #166

Open tomsutch opened 1 month ago

tomsutch commented 1 month ago

Description

Fatal R error when attempting to use extract_text on a PDF that includes $\bar{x}$. There's no error message, R just terminates.

Reproducible example

I have constructed a simple example PDF, attached xbar.pdf, that gives the error. (I made this using Microsoft Word, inserting the $x$ and $\bar{x}$ using the equation editor, then saving to PDF.)

As this crashes R I can't use the reprex package for this, as far as I know...

library(tabulapdf)

# First try getting the text up to but not including the x-bar
out1 <- extract_text("xbar.pdf", area = list(c(0,0,200,193)))
# This works

# Get the whole text
out2 <- extract_text("xbar.pdf")
# This gives a fatal error

# Get the text for just the x-bar area
out3 <- extract_text("xbar.pdf", area = list(c(0,193,200,210)))
# This gives a fatal error

Note that if I call the tabula.jar bundled with the R package directly from the command line like this

java -jar C:\Users\<username>\AppData\Local\R\win-library\4.4\tabulapdf\java\tabula.jar xbar.pdf

I get the following output (which is fine for my purposes - I am not particularly concerned about the $\bar{x}$ rendering properly, I just don't want the R session to crash):

Aug 06, 2024 10:03:59 AM org.apache.fontbox.ttf.CmapSubtable processSubtype14
WARNING: Format 14 cmap table is not supported and will be ignored
The mean of x  is denoted ???

Expected result

No fatal error: I would expect any issues with reading/rendering the $\bar{x}$ to result in a fallback like putting in '??' or similar.

Session info

R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulapdf_1.0.5-3

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        R6_2.5.1          tzdb_0.4.0        magrittr_2.0.3    glue_1.7.0        tibble_3.2.1     
 [7] pkgconfig_2.0.3   png_0.1-8         rJava_1.0-11      lifecycle_1.0.4   readr_2.1.5       cli_3.6.2        
[13] fansi_1.0.6       vctrs_0.6.5       compiler_4.4.0    rstudioapi_0.16.0 tools_4.4.0       hms_1.1.3        
[19] pillar_1.9.0      rlang_1.1.3      
pachadotdev commented 1 month ago

@tomsutch thx for reporting this I can fix it next week

pachadotdev commented 2 weeks ago

@tomsutch it took me longer than expected but I think I was able to solve it

pachadotdev commented 2 days ago

hi @tomsutch just following up did the last commit solve the issue?

tomsutch commented 1 day ago

Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?

pachadotdev commented 1 day ago

Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?

sorry, i realize i never pushed the commit

i did it now in dev/

but I realize that it fails on ubuntu but worked on windows when i set utf-8

pachadotdev commented 1 day ago

hola @jazzido

@tomsutch found this very interesting case that I can't solve "universally"

do you have any clues?

I added my test to reproduce the error here https://github.com/ropensci/tabulapdf/blob/main/dev/test-special_characters.R

and the file here https://github.com/ropensci/tabulapdf/blob/main/inst/examples/xbar.pdf