ropensci / bibtex

bibtex parser for R
https://docs.ropensci.org/bibtex/
35 stars 12 forks source link

fatal flex scanner internal error--end of buffer missed #16

Closed narayanibarve closed 1 year ago

narayanibarve commented 7 years ago

This error when I read .bib file. First I thought it happens because file is huge, with something like 5000 citations, so I exported only 4 citations from this set in bibtex format in a .bib format file. But even this 4 citations files does not work. I get the same error.

crsh commented 7 years ago

In some of the .bib-files I have encountered the error was caused by a single long field containing > 10000 characters. Also see #14.

rkrug commented 7 years ago

Anything happening here? I have the error as well and would really like to read the references into R.

Or are there any alternatives? I can use scan to read the file in, x <- scan(file=bibfile, multi.line = TRUE, sep = "\n", what = "character") followed by a x <- trimws(x), but what than? How could I parse this object?

romainfrancois commented 7 years ago

Can you prepare a reprex ?

rkrug commented 7 years ago

I am using Python for the task now. I had to adapt the workflow a bit, but now it works; and I am learning some python in parallel.

romainfrancois commented 7 years ago

@narayanibarve do you still have this problem ? If so can you prepare a reproducible example using the reprex package.

crsh commented 7 years ago

Here's a reprex for a case of a long field causing flex to break:

bibtex::read.bib("long_field.txt")
#> Error: lex fatal error:
#> input buffer overflow, can't enlarge buffer because scanner uses REJECT

long_field.txt

I used the current development version of bibtex from this repository.

crsh commented 7 years ago

Similarly, some reference managers (in this case Zotero) add a jabref comment to the bottom of the file, which causes the same error.

bibtex::read.bib("jabref_comment.txt")
#> Error: lex fatal error:
#> input buffer overflow, can't enlarge buffer because scanner uses REJECT

jabref_comment.txt

romainfrancois commented 7 years ago

Thanks. I'll have a look for the next version

swood-ecology commented 6 years ago

Just wanted to add to this that I'm having a similar problem reading in the attached .bib file from WoS.

soil.health_healthy.soil_1to500.bib.zip

Matherion commented 6 years ago

This cleans the BibTex comments, for anybody else dealing with this:

### First read file to remove the JabRef comment
cleanFile <- readLines(file.path(queryHitsPath, queryHitsFiles));

### Paste all strings together
cleanFile <- paste(cleanFile, collapse="\n");

### Remove jabref comments
cleanFile <- gsub("(?s)@[Cc]omment\\{jabref-meta:[^\\}]*\\}", "", cleanFile, perl=TRUE);

### Write clean file to disk
writeLines(cleanFile, con=file.path(queryHitsPath, "tmp-clean-file.bib"));

### Import references
queryHits[['1and2']] <- ReadBib(file.path(queryHitsPath, "tmp-clean-file.bib"));

However, for some reason it still fails to import, despite no field having even close to 10K characters in it. So there seem to be other errors, as well. Perhaps simply allowing one to specify a string to parse, and thereby letting people import the files on their own, can be a simple, relatively quick fix? Plus, would add functionality that can more generically be useful, so it wouldn't even be lost functionality once this bug (if it is once :-)) has been resolved :-)

Matherion commented 6 years ago

I'm no closer to solving this, but I remembered I'd actually written 'my own' function to import BibTex files, for a package I'm working on ('metabefor'). It's at https://github.com/Matherion/metabefor/blob/master/R/importBibtex.r, in case anybody's struggling with the same.

crsh commented 6 years ago

Any news on this?

kguidonimartins commented 6 years ago

Something new on this? I had the same error using both bitex and RefManageR packages, and using citr addin.

My try:

download.file(url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib", 
              destfile = "library.bib")

bibtex::read.bib(file = "library.bib")

RefManageR::ReadBib(file = "library.bib")

My session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.0.5.9000     Cite_0.1.0           rcrossref_0.8.1.9429
 [4] wordcountaddin_0.2.0 citr_0.2.0.9055      pacman_0.4.6        
 [7] knitr_1.20           picante_1.6-2        nlme_3.1-131        
[10] brranching_0.2.0     phytools_0.6-44      maps_3.2.0          
[13] data.table_1.10.4-3  flora_0.3.0          readxl_1.0.0        
[16] ape_5.0              betapart_1.5.0       forcats_0.3.0       
[19] stringr_1.3.0        dplyr_0.7.4          purrr_0.2.4         
[22] readr_1.1.1          tidyr_0.8.0          tibble_1.4.2        
[25] ggplot2_2.2.1        tidyverse_1.2.1      vegan_2.4-6         
[28] lattice_0.20-35      permute_0.9-4        bibtex_0.4.2        

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2        rprojroot_1.3-2         rstudioapi_0.7         
 [4] urltools_1.7.0          DT_0.4                  mvtnorm_1.0-7          
 [7] lubridate_1.7.3         RefManageR_0.14.20      xml2_1.2.0             
[10] codetools_0.2-15        splines_3.4.3           mnormt_1.5-5           
[13] bold_0.5.0              jsonlite_1.5            broom_0.4.3            
[16] cluster_2.0.6           compiler_3.4.3          httr_1.3.1             
[19] backports_1.1.2         assertthat_0.2.0        Matrix_1.2-12          
[22] lazyeval_0.2.1          cli_1.0.0               later_0.7.1            
[25] htmltools_0.3.6         tools_3.4.3             bindrcpp_0.2           
[28] igraph_1.1.2            coda_0.19-1             gtable_0.2.0           
[31] glue_1.2.0              taxize_0.9.3            reshape2_1.4.3         
[34] clusterGeneration_1.3.4 fastmatch_1.1-0         Rcpp_0.12.16           
[37] msm_1.6.6               cellranger_1.1.0        crul_0.5.2             
[40] debugme_1.1.0           iterators_1.0.9         psych_1.7.8            
[43] rvest_0.3.2             mime_0.5                miniUI_0.1.1           
[46] phangorn_2.4.0          devtools_1.13.5         stringdist_0.9.4.7     
[49] MASS_7.3-49             zoo_1.8-1               scales_0.5.0           
[52] rcdd_1.2                hms_0.4.2               promises_1.0           
[55] parallel_3.4.3          expm_0.999-2            animation_2.5          
[58] yaml_2.1.18             curl_3.2                memoise_1.1.0          
[61] triebeard_0.3.0         reshape_0.8.7           stringi_1.1.7          
[64] foreach_1.4.4           plotrix_3.7             geometry_0.3-6         
[67] rlang_0.2.0             pkgconfig_2.0.1         evaluate_0.10.1        
[70] bindr_0.1.1             htmlwidgets_1.0         plyr_1.8.4             
[73] magrittr_1.5            R6_2.2.2                combinat_0.0-8         
[76] whisker_0.3-2           pillar_1.2.1            haven_1.1.1            
[79] foreign_0.8-69          withr_2.1.2             mgcv_1.8-23            
[82] survival_2.41-3         scatterplot3d_0.3-41    abind_1.4-5            
[85] modelr_0.1.1            crayon_1.3.4            rmarkdown_1.9          
[88] koRpus_0.10-2           grid_3.4.3              callr_2.0.2            
[91] reprex_0.1.2            digest_0.6.15           xtable_1.8-2           
[94] httpuv_1.3.6.9007       numDeriv_2016.8-1       munsell_0.4.3          
[97] shinyjs_1.0             magic_1.5-8             quadprog_1.5-5  
kguidonimartins commented 6 years ago

The funny thing is that the code works using the reprex addin.

download.file(url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib", 
              destfile = "library.bib")
bibtex::read.bib(file = "library.bib")
#> Vellend M (2001). "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?" _Journal of Vegetation Science_, *12*,
#> pp. 545-552.
#> 
#> López-Mart\'inez JO, Sanaphre-Villanueva L, Dupuy JM,
#> Hernández-Stefanoni JL, Meave JA and Gallardo-Cruz JA (2013).
#> "$\beta$-Diversity of functional groups of woody plants in a
#> tropical dry forest in Yucatan." _PloS one_, *8*(9), pp. e73660.
#> ISSN 1932-6203, doi: 10.1371/journal.pone.0073660 (URL:
#> http://doi.org/10.1371/journal.pone.0073660), <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> Swenson NG, Stegen JC, Davies SJ, Erickson DL, Forero-Montaña J,
#> Hurlbert AH, Kress WJ, Thompson J, Uriarte M, Wright SJ and
#> Zimmerman JK (2012). "Temporal turnover in the composition of
#> tropical tree communities: functional determinism and phylogenetic
#> stochasticity." _Ecology_, *93*(3), pp. 490-499. ISSN 0012-9658,
#> doi: 10.1890/11-1180.1 (URL: http://doi.org/10.1890/11-1180.1),
#> <URL: http://doi.wiley.com/10.1890/11-1180.1>.

RefManageR::ReadBib(file = "library.bib")
#> Warning in parse_Rd(Rd, encoding = encoding, fragment = fragment, ...):
#> <connection>:3: unknown macro '\beta'
#> Warning in parse_Rd(Rd, encoding = encoding, fragment = fragment, ...):
#> <connection>:3: unknown macro '\beta'
#> [1] J. O. López-Mart\'inez, L. Sanaphre-Villanueva, J. M. Dupuy,
#> et al. "$\beta$-Diversity of functional groups of woody plants in
#> a tropical dry forest in Yucatan.". In: _PloS one_ 8.9 (Jan.
#> 2013), p. e73660. ISSN: 1932-6203. DOI:
#> 10.1371/journal.pone.0073660. <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> [2] N. G. Swenson, J. C. Stegen, S. J. Davies, et al. "Temporal
#> turnover in the composition of tropical tree communities:
#> functional determinism and phylogenetic stochasticity". In:
#> _Ecology_ 93.3 (Mar. 2012), pp. 490-499. ISSN: 0012-9658. DOI:
#> 10.1890/11-1180.1. <URL: http://doi.wiley.com/10.1890/11-1180.1>.
#> 
#> [3] M. Vellend. "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?". In: _Journal of Vegetation Science_
#> 12 (2001), pp. 545-552.
swood-ecology commented 6 years ago

I've been reading bib files with readFiles in the bibliometrix package.

mohdkarim commented 4 years ago

Hi,

I am using citr and Rmarkdown with Zotero. I partially got around this problem with crsh's suggestion of omitting abstract, but some bibtex entries have 500/1000+ author names, that reproduces the problem.

Any suggestions, has anyone come around with a solution to this?

AmiZya commented 4 years ago

I have the same problem with Rmarkdown and citr. Any suggested solution for this please ?

NeutralKaon commented 4 years ago

I am having this issue for parsing a long list of authors too. Any progress?

dieghernan commented 2 years ago

Hi, I think this issue may be closed after #47

I parsed all your example files with the upcoming version of bibtex, where the C code is replaced by R code and the described issue is not observed anymore. The files are read accodingly:

# PR 47 https://github.com/ropensci/bibtex/pull/47

library(bibtex)

# File 1 ----

f1 <- tempfile("file1", fileext = ".txt")

download.file(
  "https://github.com/romainfrancois/bibtex/files/1120203/long_field.txt",
  f1
)

ex1 <- read.bib(f1)
ex1
#> Batzill M (2012). "The Surface Science of Graphene: Metal Interfaces,
#> CVD Synthesis, Nanoribbons, Chemical Modifications, and Defects."
#> _SURFACE SCIENCE REPORTS_, *67*(3-4), 83-115. ISSN 0167-5729, doi:
#> 10.1016/j.surfrep.2011.12.001 (URL:
#> https://doi.org/10.1016/j.surfrep.2011.12.001).

# File 2 ----
f2 <- tempfile("file2", fileext = ".txt")

download.file(
  "https://github.com/romainfrancois/bibtex/files/1120229/jabref_comment.txt",
  f2
)

ex2 <- read.bib(f2)
ex2
#> Gómez RL (2002). "Variability and Detection of Invariant Structure."
#> _Psychological Science_, *13*(5), 431-436. ISSN 0956-7976, 1467-9280,
#> doi: 10.1111/1467-9280.00476 (URL:
#> https://doi.org/10.1111/1467-9280.00476), <URL: 2015-01-20>.

# File 3 -----
f3 <- tempfile("file3", fileext = ".zip")
download.file(
  "https://github.com/romainfrancois/bibtex/files/1229495/soil.health_healthy.soil_1to500.bib.zip",
  f3
)

unzip(f3, junkpaths = TRUE, exdir = tempdir())
ex3 <- read.bib(
  file.path(
    tempdir(),
    "soil.health_healthy.soil_1to500.bib"
  )
)
#> ignoring entry 'ISI:000268383100002' (line 34779) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100003' (line 34853) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100004' (line 34928) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100005' (line 34999) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100006' (line 35080) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100008' (line 35134) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author
#> ignoring entry 'ISI:000268383100010' (line 35192) because :
#>  A bibentry of bibtype 'InCollection' has to specify the field: author

length(ex3)
#> [1] 493

# Small sample of entries, since the file has 500 (493 read)

ex3[1:5]
#> FORMAN J (1951). "SOIL, HEALTH, AND THE DENTAL PROFESSION." _JOURNAL OF
#> PROSTHETIC DENTISTRY_, *1*(5), 508-522. ISSN 0022-3913, doi:
#> 10.1016/0022-3913(51)90037-6 (URL:
#> https://doi.org/10.1016/0022-3913(51)90037-6).
#> 
#> SHARMA N, MADAN M (1983). "EARTHWORMS FOR SOIL HEALTH AND
#> POLLUTION-CONTROL." _JOURNAL OF SCIENTIFIC \& INDUSTRIAL RESEARCH_,
#> *42*(10), 575-583. ISSN 0022-4456.
#> 
#> HABERERN J (1992). "A SOIL HEALTH INDEX." _JOURNAL OF SOIL AND WATER
#> CONSERVATION_, *47*(1), 6. ISSN 0022-4561.
#> 
#> [Anonymous] (1993). "THE BREAD CORNER - NO BREAD WITHOUT HEALTHY SOIL."
#> _ALIMENTA_, *32*(3), 45. ISSN 0002-5402.
#> 
#> Watts M (1994). "Pesticide residues in food: The views of the Soil \&
#> Health Association of New Zealand." In Savage, GP (ed.), _PROCEEDINGS
#> OF THE NUTRITION SOCIETY OF NEW ZEALAND, VOL 19_, volume 19 number 0
#> series PROCEEDINGS OF THE NUTRITION SOCIETY OF NEW ZEALAND, 58-63. Nutr
#> Soc New Zealand, ANIMAL \& VETERINARY SCI GROUP, LINCOLN UNIVERSITY, PO
#> BOX 84, CANTERBURY, NEW ZEALAND. 29th Annual Conference of the
#> Nutrition-Society-of-New-Zealand, CHRISTCHURCH, NEW ZEALAND, AUG, 1994.

# From gist ----
gist <- tempfile(fileext = ".bib")

download.file(
  url = "https://gist.githubusercontent.com/kguidonimartins/6ca03106109cef5a891c67748b895e6a/raw/32c0e203de7875a1d13db6705aa9b507914a9fd9/library.bib",
  destfile = gist
)

bibtex::read.bib(file = gist)
#> Vellend M (2001). "Do commonly used indices of $\beta$ -diversity
#> measure species turnover ?" _Journal of Vegetation Science_, *12*,
#> 545-552.
#> 
#> López-Mart\'inez JO, Sanaphre-Villanueva L, Dupuy JM,
#> Hernández-Stefanoni JL, Meave JA, Gallardo-Cruz JA (2013).
#> "$\beta$-Diversity of functional groups of woody plants in a tropical
#> dry forest in Yucatan." _PloS one_, *8*(9), e73660. ISSN 1932-6203,
#> doi: 10.1371/journal.pone.0073660 (URL:
#> https://doi.org/10.1371/journal.pone.0073660), <URL:
#> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3769343{\&}tool=pmcentrez{\&}rendertype=abstract>.
#> 
#> Swenson NG, Stegen JC, Davies SJ, Erickson DL, Forero-Montaña J,
#> Hurlbert AH, Kress WJ, Thompson J, Uriarte M, Wright SJ, Zimmerman JK
#> (2012). "Temporal turnover in the composition of tropical tree
#> communities: functional determinism and phylogenetic stochasticity."
#> _Ecology_, *93*(3), 490-499. ISSN 0012-9658, doi: 10.1890/11-1180.1
#> (URL: https://doi.org/10.1890/11-1180.1), <URL:
#> http://doi.wiley.com/10.1890/11-1180.1>.

Created on 2022-01-17 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.2 (2021-11-01) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Spanish_Spain.1252 #> ctype Spanish_Spain.1252 #> tz Europe/Paris #> date 2022-01-17 #> pandoc 2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2) #> bibtex * 0.5.0 2022-01-17 [1] local #> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1) #> fansi 1.0.0 2022-01-10 [1] CRAN (R 4.1.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2) #> glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.1) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.1) #> pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.1) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1) #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1) #> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.2) #> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.1) #> #> [1] C:/Users/diego/Documents/R/win-library/4.1 #> [2] C:/Program Files/R/R-4.1.2/library #> #> ------------------------------------------------------------------------------ ```