Open kjayhan opened 11 months ago
Found a solution, just in case someone else has the same problem with the help of a stackoverflow user and ChatGPT:
convert_fullwidth_to_numeric <- function(input_str) {
utf8_codes <- utf8ToInt(input_str)
# Handle fullwidth minus sign (-) separately
utf8_codes <- ifelse(utf8_codes == 65293, 45, utf8_codes)
converted_utf8_codes <- ifelse(utf8_codes >= 65296 & utf8_codes <= 65305, utf8_codes - 65248, utf8_codes)
converted_chars <- intToUtf8(converted_utf8_codes)
converted_numeric <- as.numeric(converted_chars)
return(converted_numeric)
}
# Apply the function to specified columns (columns 2 to 10)
columns_to_transform <- 2:10 # Adjust column indices as needed
for (col in columns_to_transform) {
for (row in 1:nrow(pdf_clean1)) {
pdf_clean1[row, col] <- convert_fullwidth_to_numeric(pdf_clean1[row, col])
}
}
https://stackoverflow.com/questions/76895064/number-as-character-cannot-be-converted-to-numeric-in-r
I extracted some data from a Chinese pdf file.
The numbers in the columns are extracted as follows (for example): -122, 29458, 9.
I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.
Hence parse.number() produces NA in all of these cases.
Any suggestions regarding what I should do?
This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf
I extracted the data from page 49 (53rd page of the pdf file), using the following code:
I tried both, e.g.,
as.numeric(pdf_clean1$year_2013)
andparse_number(pdf_clean$year_2013)
Both produced NAs, because the outcome for all of
"9" == "9" "-122" == "-122" "29458" == "29458"
are "FALSE".sessionInfo()