Open Mihiretukebede opened 4 years ago
Second this! It would make it really really great. I work with a lot of Chinese-language bibliographical data, and df2bib always breaks the characters... I just spent 40 minutes trying to tinker with the code, randomly throwing some fields <- map(fields, ~as_utf8(.x))
just to see if it would do anything... No luck. I would love for this to work. Unfortunately I don't yet know what I don't know in order to try to help.
Note that when using the locale "chs", characters come out like this: ÄϾ©¹ÄÂ¥Ò½Ôº¸ÎÔàÒÆÖ²ÖÐÐÄ
When I use the "C" locale, they come out like this: <U+5357><U+4EAC><U+9F13><U+697C><U+533B> (and I can't really figure out how to turn them back into characters easily.)
My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.
I see. Any idea how to fix? Current workarounds are getting clunky, if they work at all...
* dispatched from a small screen
-------- Original Message -------- On Feb 21, 2021, 16:01, Ross wrote:
My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I clean the file like this:
bsf.df <- readLines(filename.df,encoding="UTF-8") bsf.df <- str_replace_all(bsf.df, "[^[:graph:]]", " ") bsf.df <- iconv(bsf.df, from = 'UTF-8', to = 'ASCII//TRANSLIT') outfile <- "bsfdf.bib") writeLines(bsf.df,con=outfile)
Then:
bib2df(filename.df,separate_names=TRUE)
What worked for me was to make small changes in two functions of the package:
bib2df_read
and bib2df_tidy
In the former function I set the encoding argument to UTF-8: readLines(file,encoding = "UTF-8")
In the latter function there are two lapply functions and I added enc2native() %>%
to both of them like so:
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names())
To do this you have to download the code and look for the functions in a folder named R. You will also need the function text_between_curly_brackets :
text_between_curly_brackets <- function(string) { min <- min(gregexpr("\\{", string)[[1]]) max <- max(gregexpr("\\}", string)[[1]]) content <- substring(string, min + 1, max - 1) return(content) }
Hope this helps.
library(dplyr) library(ggplot2) library(tidyr) library(humaniformat) library(plyr) library(stringr)
file <- "A:/path/to/file.bib"
bib2df_read <- function(file) { bib <- readLines(file,encoding = "UTF-8") bib <- str_replace_all(bib, "[^[:graph:]]", " ") return(bib) } bib2df_gather <- function(bib) {
from <- which(str_extract(bib, "[:graph:]") == "@") to <- c(from[-1] - 1, length(bib)) if (!length(from)) { return(empty) } itemslist <- mapply( function(x, y) return(bib[x:y]), x = from, y = to - 1, SIMPLIFY = FALSE ) keys <- lapply(itemslist, function(x) { str_extract(x[1], "(?<=\{)[^,]+") } ) fields <- lapply(itemslist, function(x) { str_extract(x[1], "(?<=@)[^\{]+") } ) fields <- lapply(fields, toupper)
categories <- lapply(itemslist, function(x) { strextract(x, "[[:alnum:]-]+") } )
dupl <- sum( unlist( lapply(categories, function(x) sum(duplicated(x[!is.na(x)]))) ) )
if (dupl > 0) { message("Some BibTeX entries may have been dropped. The result could be malformed. Review the .bib file and make sure every single entry starts with a '@'.") }
values <- lapply(itemslist, function(x) { str_extract(x, "(?<==).*") } )
values <- lapply(values, function(x) { sapply(x, text_between_curly_brackets, simplify = TRUE, USE.NAMES = FALSE) } )
values <- lapply(values, trimws) items <- mapply(cbind, categories, values, SIMPLIFY = FALSE) items <- lapply(items, function(x) { x <- cbind(toupper(x[, 1]), x[, 2]) } ) items <- lapply(items, function(x) { x[complete.cases(x), ] } ) items <- mapply(function(x, y) { rbind(x, c("CATEGORY", y)) }, x = items, y = fields, SIMPLIFY = FALSE)
items <- lapply(items, t) items <- lapply(items, function(x) { colnames(x) <- x[1, ] x <- x[-1, ] return(x) } ) items <- lapply(items, function(x) { x <- t(x) x <- data.frame(x, stringsAsFactors = FALSE) return(x) } ) dat <- bind_rows(c(list(empty), items)) dat <- as_tibble(dat) dat$BIBTEXKEY <- unlist(keys) dat }
empty <- data.frame( CATEGORY = character(0L), BIBTEXKEY = character(0L), ADDRESS = character(0L), ANNOTE = character(0L), AUTHOR = character(0L), BOOKTITLE = character(0L), CHAPTER = character(0L), CROSSREF = character(0L), EDITION = character(0L), EDITOR = character(0L), HOWPUBLISHED = character(0L), INSTITUTION = character(0L), JOURNAL = character(0L), KEY = character(0L), MONTH = character(0L), NOTE = character(0L), NUMBER = character(0L), ORGANIZATION = character(0L), PAGES = character(0L), PUBLISHER = character(0L), SCHOOL = character(0L), SERIES = character(0L), TITLE = character(0L), TYPE = character(0L), VOLUME = character(0L), YEAR = character(0L), stringsAsFactors = FALSE )
bib2df_tidy <- function(bib, separate_names = FALSE) {
if (dim(bib)[1] == 0) { return(bib) }
AUTHOR <- EDITOR <- YEAR <- CATEGORY <- NULL
if ("AUTHOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(AUTHOR = strsplit(AUTHOR, " and ", fixed = TRUE))
if (separate_names) {
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("EDITOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(EDITOR = strsplit(EDITOR, " and ", fixed = TRUE))
if (separate_names) {
bib$EDITOR <- lapply(bib$EDITOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("YEAR" %in% colnames(bib)) {
if (sum(is.na(as.numeric(bib$YEAR))) == 0) {
bib <- bib %>%
mutate(YEAR = as.numeric(YEAR))
} else {
message("Column YEAR
contains character strings.
No coercion to numeric applied.")
}
}
bib <- bib %>%
select(CATEGORY, dplyr::everything())
return(bib)
}
bib <- bib2df_read(file) bib <- bib2df_gather(bib) bib <- bib2df_tidy(bib,separate_names = TRUE)
bib %>% select(YEAR, AUTHOR) %>% unnest(cols = c(AUTHOR)) %>% ggplot() + aes(x = YEAR, y = reorder(full_name, desc(YEAR))) + geom_point()
I think that this isn't something that can be addressed in the bib2df package itself. I've made a PR with a warning message if the file isn't ASCII, UTF-8 or UTF-16 which should help users address this on their own.
I have difficulties reading some of my reference files. R markdown shows there is an error in the UTF-8 encoding of my BibTex files. Please add UTF-8 encoding. This would be great!