gsub error 'unable to translate...to a wide string'

wilkox commented 1 year ago

Running read_bibliography() on a UTF-8 encoded file produces an error (see example file Cochrane.txt):

library(revtools)

system2("file", c("Cochrane.txt", "-I"), stdout = TRUE)
#> [1] "Cochrane.txt: text/plain; charset=utf-8"
read_bibliography("Cochrane.txt")
#> Warning in gsub("<[[:alnum:]]{2}>", "", z): unable to translate 'AB - Three
#> hundred healthy adults, permanently residing and contacting (a contact subject)
#> with a household patient with confirmed COVID^aEUR<90>19 (primary patient), or
#> who stayed in close long protected contact with a person who consequently
#> become...' to a wide string
#> Error in gsub("<[[:alnum:]]{2}>", "", z): input string 14 is invalid

^{Created on 2023-10-10 with reprex v2.0.2}

This seems to arise from this line, and I think it's because the encoding for z is set to 'latin1', but since R 4.3.0 'Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)'.

A workaround is to convert the file into latin1 encoding first:

library(revtools)

utf8tolatin1 <- function(infile, outfile) {
  content <- readLines(infile, encoding = "UTF-8")
  latin1 <- iconv(content, from = "UTF-8", to = "latin1")
  writeLines(latin1, outfile)
}

utf8tolatin1("Cochrane.txt", "Cochrane-latin1.txt")

system2("file", c("Cochrane-latin1.txt", "-I"), stdout = TRUE)
#> [1] "Cochrane-latin1.txt: text/plain; charset=us-ascii"
read_bibliography("Cochrane-latin1.txt")
#>                   label type   accession       author
#> 1 NCT04907877_2021_http JOUR CN-02278011 NCT04907877,
#>                                                                title
#> 1 Bifido- and Lactobacilli in Symptomatic Adult COVID-19 Outpatients
#>                                       journal year                     keywords
#> 1 https://clinicaltrials.gov/show/NCT04907877 2021 Respiratory Tract Infections
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abstract
#> 1 Three hundred healthy adults, permanently residing and contacting (a contact subject) with a household patient with confirmed COVID-19 (primary patient), or who stayed in close long protected contact with a person who consequently become SARS-CoV-2 positive, will be screened for the study. When the contact subject meets enrollment criteria, he/she will be randomized to take an investigational product (probiotic, test dietary supplement, TDS), a mixture of lactobacilli and bifidobacteria or placebo 1 time a day before breakfast. During screening period, he/she will also keep Screening and Compliance Diary for screening of COVID-19 symptoms and confirming TDS intake. Duration of the screening period (Days 0-X) will depend on the health status of a contact person. If the contact remains asymptomatic, duration of probiotic intake will be 30 days. After this period, subject will be excluded from the study . If the contact develops symptoms, he/she will call family physician, request a referral, and visit a local center to make PCR test of the nasal swab for SARS-CoV-2. While result of PCR test are being available (Days 0-2), the patient will continue taking TDS and start keeping Respiratory Illness Diary. If the result of the PCR test is negative the patient will be withdrawn from the study. If the result is positive, he/she will continue participation and be visited by the nurse (Nurse Visit 1, Days 3-5), who supplies the patient with TDS in amount enough to complete 28-day intake period and takes blood for anti-SARS-CoV-2 IgG. During 28-day period of TDS intake, the patient will keep Respiratory Illness Diary (the Diary is designed for evaluation of the COVID-19 course and assessment of TDS reduces clinical manifestation of COVID-19), the investigator/family physician updated with health status, and the physician will make weekly phone calls to assess patient health status, indications for hospitalization, treatment, checking TDS intake and Respiratory Illness Diary. In the case of patient hospitalization, patient is withdrawn from the study, and will be requested to provide a reference from Medical Records after hospital discharge. During Nurse Visit 2 (Days 28-35), after finishing TDS intake, the nurse will collect Respiratory Illness Diary, empty vials with TDS, take blood for anti-SARS-CoV-2 IgG test. The test is necessary to evaluate if TDS intake improves post-COVID-19 immunity on short-term perspective. At the end of the 2nd visit, the nurse will give enveloped Post-COVID-19 Questionnaire to be completed in 3 months. In 3 months, investigator/family physician will call to the patient and remind to return a completed Post-COVID-19 Questionnaire. Post-COVID-19 Questionnaire will help to see if active TDS reduces presentation of Post-COVID-19 syndrome. In 6 months, the study nurse will perform Nurse Visit 3 and draw blood for the anti-SARS-CoV-2 IgG. The test is necessary to evaluate if TDS intake improves post-COVID-19 immunity on long term perspective.
#>                                                                            url
#> 1 https://www.cochranelibrary.com/central/doi/10.1002/central/CN-02278011/full
#>                  c3                    m3
#> 1 CTgov NCT04907877 Trial registry record

^{Created on 2023-10-10 with reprex v2.0.2}

sy-olesya commented 10 months ago

David! Thanks a lot for your help! Unfortunately, this doesn't work for files from PubMed (.nbib). Could you help as well? vk.txt

wilkox commented 10 months ago

@sy-olesya Adding useBytes = TRUE to writeLines() seems to fix this particular problem. However, there is then another, apparently unrelated error (I had to truncate the input file as it couldn't fit the whole thing in memory):

library(revtools)

system2("file", c("~/tmp/vk.txt", "-I"), stdout = TRUE)
#> [1] "/Users/wilkox/tmp/vk.txt: text/plain; charset=utf-8"

utf8tolatin1 <- function(infile, outfile) {
  content <- readLines(infile, encoding = "UTF-8")
  latin1 <- iconv(content, from = "UTF-8", to = "latin1")
  writeLines(latin1, outfile, useBytes = TRUE)
}

utf8tolatin1("~/tmp/vk.txt", "~/tmp/vk-latin1.txt")

system2("file", c("~/tmp/vk-latin1.txt", "-I"), stdout = TRUE)
#> [1] "/Users/wilkox/tmp/vk-latin1.txt: text/plain; charset=iso-8859-1"
bib <- read_bibliography("~/tmp/vk-latin1.txt")
#> Error in names(x_final) <- unlist(lapply(x_final, function(a) {: 'names' attribute [254] must be the same length as the vector [43]

^{Created on 2024-01-31 with reprex v2.1.0}

I had a poke around and I think it's not parsing the nbib file correctly. You might want to open a separate issue about this if you are still having trouble.

vivekrmk commented 4 months ago

I was getting the error : Error in gsub("[[:space:]]+", " ", x) : input string 12 is invalid setting up the encoding option in readLines as "latin1" fixed the issue for me, I did not receive the error again.

mjwestgate / revtools

gsub error 'unable to translate...to a wide string' #42