ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
548 stars 71 forks source link

extracting tables with varying grouping marks (locale issue) #167

Open AndySibov opened 2 months ago

AndySibov commented 2 months ago

I didn't think there would be a package out there for this, thanks!

I was importing a table where the grouping mark is a dot, with values around 10.000. As such, the extract_table function returns a double such as 1.950 in the form of 1.95.

image

best would be to be able to set import option for locale() for grouping marks and such.

below is a function to recover these imported doubles, but it doesn't work for doubles that have all zero's in the decimals (e.g. input 100.00 (from original value 100.000) will result in 100.

recover_double_grouping_mark <- function(value, grouping_mark = '.', interval = 1000) {

dbl_as_char <- as.character(value)

determine the interval

interval <- log10(interval)

Vectorized counting of grouping marks for each element in the vector

dot_count <- str_count(dbl_as_char, pattern = paste0('\', grouping_mark))

Vectorized finding of the position of the first grouping mark and counting digits before it

int_count <- sapply(gregexpr(grouping_mark, dbl_as_char), function(x) min(x) - 1)

Calculate the difference between expected and actual number of digits for each element

dif_expected_nchar <- ifelse(dot_count > 0, abs(int_count - (dot_count * interval)), 0)

Vectorized adjustment of values where there's a mismatch in character length

adjusted_values <- ifelse(dif_expected_nchar > 0, value * 10^dif_expected_nchar, value)

return(adjusted_values) }

pachadotdev commented 1 month ago

to fix your trouble check this solution click maybe this will solve your problem.

LOL no

I opened this in a container and it shows this

image

image

reported and blocked

pachadotdev commented 1 month ago

I didn't think there would be a package out there for this, thanks!

I was importing a table where the grouping mark is a dot, with values around 10.000. As such, the extract_table function returns a double such as 1.950 in the form of 1.95.

image

best would be to be able to set import option for locale() for grouping marks and such.

below is a function to recover these imported doubles, but it doesn't work for doubles that have all zero's in the decimals (e.g. input 100.00 (from original value 100.000) will result in 100.

recover_double_grouping_mark <- function(value, grouping_mark = '.', interval = 1000) {

dbl_as_char <- as.character(value)

determine the interval interval <- log10(interval)

Vectorized counting of grouping marks for each element in the vector dot_count <- str_count(dbl_as_char, pattern = paste0('\', grouping_mark))

Vectorized finding of the position of the first grouping mark and counting digits before it int_count <- sapply(gregexpr(grouping_mark, dbl_as_char), function(x) min(x) - 1)

Calculate the difference between expected and actual number of digits for each element dif_expected_nchar <- ifelse(dot_count > 0, abs(int_count - (dot_count * interval)), 0)

Vectorized adjustment of values where there's a mismatch in character length adjusted_values <- ifelse(dif_expected_nchar > 0, value * 10^dif_expected_nchar, value)

return(adjusted_values) }

hi @AndySibov

sorry the late reply, do you have a real link to the PDF

if there are no links, my email is in my description

sorry about the idiot that included a phising link as an answer