ropensci / bib2df

Parse a BibTeX file to a tibble
https://docs.ropensci.org/bib2df
99 stars 22 forks source link

UTF-8 encoding #44

Open Mihiretukebede opened 4 years ago

Mihiretukebede commented 4 years ago

I have difficulties reading some of my reference files. R markdown shows there is an error in the UTF-8 encoding of my BibTex files. Please add UTF-8 encoding. This would be great!

mpr1255 commented 3 years ago

Second this! It would make it really really great. I work with a lot of Chinese-language bibliographical data, and df2bib always breaks the characters... I just spent 40 minutes trying to tinker with the code, randomly throwing some fields <- map(fields, ~as_utf8(.x)) just to see if it would do anything... No luck. I would love for this to work. Unfortunately I don't yet know what I don't know in order to try to help.

Note that when using the locale "chs", characters come out like this: ÄϾ©¹ÄÂ¥Ò½Ôº¸ÎÔàÒÆÖ²ÖÐÐÄ

When I use the "C" locale, they come out like this: <U+5357><U+4EAC><U+9F13><U+697C><U+533B> (and I can't really figure out how to turn them back into characters easily.)

GilmourR commented 3 years ago

My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.

mpr1255 commented 3 years ago

I see. Any idea how to fix? Current workarounds are getting clunky, if they work at all...

* dispatched from a small screen

-------- Original Message -------- On Feb 21, 2021, 16:01, Ross wrote:

My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

GilmourR commented 3 years ago

I clean the file like this: bsf.df <- readLines(filename.df,encoding="UTF-8") bsf.df <- str_replace_all(bsf.df, "[^[:graph:]]", " ") bsf.df <- iconv(bsf.df, from = 'UTF-8', to = 'ASCII//TRANSLIT') outfile <- "bsfdf.bib") writeLines(bsf.df,con=outfile) Then: bib2df(filename.df,separate_names=TRUE)

harkanatta commented 3 years ago

What worked for me was to make small changes in two functions of the package: bib2df_read and bib2df_tidy In the former function I set the encoding argument to UTF-8: readLines(file,encoding = "UTF-8") In the latter function there are two lapply functions and I added enc2native() %>% to both of them like so: bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names())

To do this you have to download the code and look for the functions in a folder named R. You will also need the function text_between_curly_brackets : text_between_curly_brackets <- function(string) { min <- min(gregexpr("\\{", string)[[1]]) max <- max(gregexpr("\\}", string)[[1]]) content <- substring(string, min + 1, max - 1) return(content) } Hope this helps.

harkanatta commented 3 years ago

library(dplyr) library(ggplot2) library(tidyr) library(humaniformat) library(plyr) library(stringr)

file <- "A:/path/to/file.bib"

bib2df_read <- function(file) { bib <- readLines(file,encoding = "UTF-8") bib <- str_replace_all(bib, "[^[:graph:]]", " ") return(bib) } bib2df_gather <- function(bib) {

from <- which(str_extract(bib, "[:graph:]") == "@") to <- c(from[-1] - 1, length(bib)) if (!length(from)) { return(empty) } itemslist <- mapply( function(x, y) return(bib[x:y]), x = from, y = to - 1, SIMPLIFY = FALSE ) keys <- lapply(itemslist, function(x) { str_extract(x[1], "(?<=\{)[^,]+") } ) fields <- lapply(itemslist, function(x) { str_extract(x[1], "(?<=@)[^\{]+") } ) fields <- lapply(fields, toupper)

categories <- lapply(itemslist, function(x) { strextract(x, "[[:alnum:]-]+") } )

dupl <- sum( unlist( lapply(categories, function(x) sum(duplicated(x[!is.na(x)]))) ) )

if (dupl > 0) { message("Some BibTeX entries may have been dropped. The result could be malformed. Review the .bib file and make sure every single entry starts with a '@'.") }

values <- lapply(itemslist, function(x) { str_extract(x, "(?<==).*") } )

values <- lapply(values, function(x) { sapply(x, text_between_curly_brackets, simplify = TRUE, USE.NAMES = FALSE) } )

values <- lapply(values, trimws) items <- mapply(cbind, categories, values, SIMPLIFY = FALSE) items <- lapply(items, function(x) { x <- cbind(toupper(x[, 1]), x[, 2]) } ) items <- lapply(items, function(x) { x[complete.cases(x), ] } ) items <- mapply(function(x, y) { rbind(x, c("CATEGORY", y)) }, x = items, y = fields, SIMPLIFY = FALSE)

items <- lapply(items, t) items <- lapply(items, function(x) { colnames(x) <- x[1, ] x <- x[-1, ] return(x) } ) items <- lapply(items, function(x) { x <- t(x) x <- data.frame(x, stringsAsFactors = FALSE) return(x) } ) dat <- bind_rows(c(list(empty), items)) dat <- as_tibble(dat) dat$BIBTEXKEY <- unlist(keys) dat }

empty <- data.frame( CATEGORY = character(0L), BIBTEXKEY = character(0L), ADDRESS = character(0L), ANNOTE = character(0L), AUTHOR = character(0L), BOOKTITLE = character(0L), CHAPTER = character(0L), CROSSREF = character(0L), EDITION = character(0L), EDITOR = character(0L), HOWPUBLISHED = character(0L), INSTITUTION = character(0L), JOURNAL = character(0L), KEY = character(0L), MONTH = character(0L), NOTE = character(0L), NUMBER = character(0L), ORGANIZATION = character(0L), PAGES = character(0L), PUBLISHER = character(0L), SCHOOL = character(0L), SERIES = character(0L), TITLE = character(0L), TYPE = character(0L), VOLUME = character(0L), YEAR = character(0L), stringsAsFactors = FALSE )

bib2df_tidy <- function(bib, separate_names = FALSE) {

if (dim(bib)[1] == 0) { return(bib) }

AUTHOR <- EDITOR <- YEAR <- CATEGORY <- NULL if ("AUTHOR" %in% colnames(bib)) { bib <- bib %>% mutate(AUTHOR = strsplit(AUTHOR, " and ", fixed = TRUE)) if (separate_names) { bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names()) } } if ("EDITOR" %in% colnames(bib)) { bib <- bib %>% mutate(EDITOR = strsplit(EDITOR, " and ", fixed = TRUE)) if (separate_names) { bib$EDITOR <- lapply(bib$EDITOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names()) } } if ("YEAR" %in% colnames(bib)) { if (sum(is.na(as.numeric(bib$YEAR))) == 0) { bib <- bib %>% mutate(YEAR = as.numeric(YEAR)) } else { message("Column YEAR contains character strings. No coercion to numeric applied.") } } bib <- bib %>% select(CATEGORY, dplyr::everything()) return(bib) }

bib <- bib2df_read(file) bib <- bib2df_gather(bib) bib <- bib2df_tidy(bib,separate_names = TRUE)

bib %>% select(YEAR, AUTHOR) %>% unnest(cols = c(AUTHOR)) %>% ggplot() + aes(x = YEAR, y = reorder(full_name, desc(YEAR))) + geom_point()

HedvigS commented 7 months ago

I think that this isn't something that can be addressed in the bib2df package itself. I've made a PR with a warning message if the file isn't ASCII, UTF-8 or UTF-16 which should help users address this on their own.