tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
999 stars 286 forks source link

Parse number cannot recognize number #1507

Open kjayhan opened 11 months ago

kjayhan commented 11 months ago

I extracted some data from a Chinese pdf file.

The numbers in the columns are extracted as follows (for example): -122, 29458, 9.

I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.

Hence parse.number() produces NA in all of these cases.

Any suggestions regarding what I should do?

This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf

I extracted the data from page 49 (53rd page of the pdf file), using the following code:

library(tidyverse)
library(pdftools)

file <- tempfile()

url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf") 

download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))

pdf_data <- pdf_text(file)

replace_spaces_and_commas <- function(x) {
  str_replace_all(x, "[ ,]", "")
}

pdf <- pdf_data[53:71]

tab_pdf <- str_split(pdf, "\n")

for (i in 1:19) {
  assign(paste0("tab_pdf_", i), tab_pdf[[i]])
}

the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")

view(tab_pdf_1)

pdf_clean1 <- tab_pdf_1[14:60] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 

I tried both, e.g., as.numeric(pdf_clean1$year_2013) and parse_number(pdf_clean$year_2013)

Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458" are "FALSE".

sessionInfo()

R version 4.3.1 (2023-06-16) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

attached base packages: [1] stats graphics grDevices utils datasets methods
[7] base

other attached packages: [1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3
[4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0

loaded via a namespace (and not attached): [1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2
[4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1
[7] R6_2.5.1 generics_0.1.3 knitr_1.42
[10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[16] xfun_0.39 timechange_0.2.0 cli_3.6.1
[19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0 [22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3
[25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4
[28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3

kjayhan commented 11 months ago

Found a solution, just in case someone else has the same problem with the help of a stackoverflow user and ChatGPT:

convert_fullwidth_to_numeric <- function(input_str) {
  utf8_codes <- utf8ToInt(input_str)

  # Handle fullwidth minus sign (-) separately
  utf8_codes <- ifelse(utf8_codes == 65293, 45, utf8_codes)

  converted_utf8_codes <- ifelse(utf8_codes >= 65296 & utf8_codes <= 65305, utf8_codes - 65248, utf8_codes)
  converted_chars <- intToUtf8(converted_utf8_codes)
  converted_numeric <- as.numeric(converted_chars)
  return(converted_numeric)
}

# Apply the function to specified columns (columns 2 to 10)
columns_to_transform <- 2:10  # Adjust column indices as needed

for (col in columns_to_transform) {
  for (row in 1:nrow(pdf_clean1)) {
    pdf_clean1[row, col] <- convert_fullwidth_to_numeric(pdf_clean1[row, col])
  }
}

https://stackoverflow.com/questions/76895064/number-as-character-cannot-be-converted-to-numeric-in-r