ropensci / RefManageR

R package RefManageR
https://docs.ropensci.org/RefManageR
Other
114 stars 24 forks source link

toBiblatex can't handle CJK author names #106

Open kijinosu opened 2 months ago

kijinosu commented 2 months ago

I am trying to use RefManageR for biblatex bibliographies that include CJK text. While most of the bibliography is handled splendidly, toBiblatex replaces author names with question marks.

My understanding that this is caused by using the old utils::person object.

Are there any workarounds or other ways to avoid this?

library(rlang)
library(RefManageR)
library(stringi)

b <- new_environment()
ls(b)

b$bib <- BibEntry(bibtype = "article", 
        key = "shiotsuki2011kasai", 
        title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
        author = "塩月亮子", 
        journal = "宗教と社会",
        volume = 17,
        pages = "67--69",
        year = 2011, 
        publisher = "「宗教と社会」学会")

b$bib

toBiblatex(b$bib)

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {{????}},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
kijinosu commented 2 months ago

Partial workaround that uses R package stringi:

b <- new_environment()
ls(b)

b$bib <- c(BibEntry(bibtype = "article", 
        key = "shiotsuki2011kasai", 
        title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
        author = "塩,亮子 and 葛西,賢太", 
        journal = "宗教と社会",
        volume = 17,
        pages = "67--69",
        year = 2011, 
        publisher = "「宗教と社会」学会"),
        BibEntry(bibtype = "article", 
        key = "hiromitsu2022altered", 
        title = "意識状態の変容と脳内ネットワーク",
        author = "弘光健太郎 and ヒロミツケンタロウ", 
        journal = "鶴見大学仏教文化研究所紀要",
        volume = 27,
        pages = "53--66",
        year = 2022, 
        publisher = "鶴見大学")
        )
b$bib

b$biblatex <- toBiblatex(b$bib, escape=TRUE)
writeLines(b$biblatex)

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {?? ? and ?? ??},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## 
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{?????} and {?????????}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }

lapply(b$bib, function(v) {
    austr <- unlist(stri_split_boundaries(stri_flatten(unlist(v$author), collapse=""), type='character') )
    biblatex <- toBiblatex(v, escape=TRUE)
    auform <- as.character(biblatex['author'] )
    places <- stri_locate_all_regex(auform,"(?=\\?)", get_length=TRUE)[[1]][,1]
    replaced <- stri_sub_replace_all(auform,places,places,replacement=austr)
    biblatex['author'] <- replaced 
    writeLines(biblatex)
})

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {亮子 塩 and 賢太 葛西},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{弘光健太郎} and {ヒロミツケンタロウ}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }
kijinosu commented 2 months ago

A more complete workaround:

write_bib <- function(bib, file=stdout(), overwrite=FALSE){
    library(stringi)
    if (!length(bib))
        return(NULL)
    if (!inherits(bib, 'BibEntry')){
        message("bib object is not a BibEntry object")
        return(NULL)
    }
    pAFields <- c("author","editor","translator")
    zz <- file(file, "w")

    biblatex <- NULL
    lapply(bib, function(v) {
        biblatex <- unlist(toBiblatex(v, escape=TRUE))
        flds <- names(biblatex)
        for(pf in pAFields){
            if(pf %in% flds) {
                austr <- unlist(stri_split_boundaries(stri_flatten(unlist(v$author), collapse=""), type='character') )
                hasideo <- stri_detect_regex(austr, "\\p{Ideographic}")
                auform <- as.character(biblatex[pf] )
                places <- stri_locate_all_regex(auform,"(?=\\?)", get_length=TRUE)[[1]][,1]
                if(places[1] > 0){
                    replaced <- tryCatch(
                        {
                            stri_sub_replace_all(auform,places,places,replacement=austr)
                        },
                        warning = function(cond) {
                            writeLines(conditionMessage(cond),con=zz)
                            writeLines(paste("auform: ", auform),con=zz)
                            writeLines(paste("places: ", places),con=zz)
                            writeLines(paste("austr: ", austr),con=zz)
                        }
                    )
                    if(!is.null(replaced) & length(replaced) > 0) biblatex[pf] <- replaced
                }
            }
        }
        writeLines(biblatex,con=zz)
    })

    close(zz)
}
mwmclean commented 2 months ago

@kijinosu thank for your report. I'm not able to reproduce the behaviour you describe on my machine/locale; I get an error just creating your BibEntry objects with CJK characters for journal and title. Can you share your sessionInfo() please?

My understanding that this is caused by using the old utils::person object.

Can you elaborate on this?

Are you able to submit a pull request?

mwmclean commented 2 months ago

@kijinosu turned out to be an issue with my IDE. I've opened #107 if you could please install and test it and/or review 🙏

kijinosu commented 2 months ago

This worked partially:

b <- new_environment()
ls(b)

b$bib <- c(BibEntry(bibtype = "article", 
        key = "shiotsuki2011kasai", 
        title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
        author = "塩,亮子 and 葛西,賢太", 
        journal = "宗教と社会",
        volume = 17,
        pages = "67--69",
        year = 2011, 
        publisher = "「宗教と社会」学会"),
        BibEntry(bibtype = "article", 
        key = "hiromitsu2022altered", 
        title = "意識状態の変容と脳内ネットワーク",
        author = "弘光健太郎 and ヒロミツケンタロウ", 
        journal = "鶴見大学仏教文化研究所紀要",
        volume = 27,
        pages = "53--66",
        year = 2022, 
        publisher = "鶴見大学")
        )

b$biblatex <- toBiblatex(b$bib)

writeLines(b$biblatex)
## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {亮子 塩 and 賢太 葛西},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## 
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{?????} and {?????????}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }
mwmclean commented 2 months ago

Did you install the branch I mentioned with e.g. remotes::install_github("ROpenSci/RefManageR#107")?

kijinosu commented 2 months ago

@kijinosu thank for your report. I'm not able to reproduce the behaviour you describe on my machine/locale; I get an error just creating your BibEntry objects with CJK characters for journal and title. Can you share your sessionInfo() please? sessionInfo() R version 4.4.0 (2024-04-24 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale: [1] LC_COLLATE=Japanese_Japan.utf8 LC_CTYPE=Japanese_Japan.utf8 LC_MONETARY=Japanese_Japan.utf8 LC_NUMERIC=C LC_TIME=Japanese_Japan.utf8

time zone: Asia/Tokyo tzcode source: internal

My understanding that this is caused by using the old utils::person object.

Can you elaborate on this? I seem to be mistaken about utils::person.

Are you able to submit a pull request? Sorry, I am not familiar enough with github.

mwmclean commented 2 months ago

Did you see my previous message? https://github.com/ropensci/RefManageR/issues/106#issuecomment-2370230829 Are you able to install R packages from GitHub?

kijinosu commented 2 months ago

Did you install the branch I mentioned with e.g. remotes::install_github("ROpenSci/RefManageR#107")?

Yes

mwmclean commented 2 months ago

I'm no longer able to produce output with ???? for the author names with that branch. This is tested with the unit tests here and the tests pass on r-universe CI on macOS, windows, and Ubuntu.

kijinosu commented 2 months ago

How about replacing tools::encoded_text_to_latex with dplR::latexify?

mwmclean commented 2 months ago

How about replacing tools::encoded_text_to_latex with dplR::latexify?

What are the benefits?

kijinosu commented 2 months ago

It uses stringi, which is a wrapper for the International Components for Unicode, and handles CJK properly.

Sent from Proton Mail for iOS

2024年9月25日 (水) 13:20, Mathew W. McLean @.***(mailto:2024年9月25日 (水) 13:20, Mathew W. McLean < 送信:

How about replacing tools::encoded_text_to_latex with dplR::latexify?

What are the benefits?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

mwmclean commented 2 months ago

By handles CJK properly, you mean leaves it as is? That's what the PR currently does without adding an extra package as a dependency.

mwmclean commented 2 months ago

@kijinosu Looks like latexify fixes #102. You can test in out by installing #109. Thanks for the suggestion.

kijinosu commented 2 months ago

By handles CJK properly, you mean leaves it as is? That's what the PR currently does without adding an extra package as a dependency.

I mean that it uses stringi, which is a wrapper for ICU4C , a part of the Unicode standard https://icu.unicode.org/.