ropensci / handlr

convert among citation formats
https://docs.ropensci.org/handlr
Other
38 stars 4 forks source link

large buffer overflows in bib #9

Closed yonicd closed 4 years ago

yonicd commented 5 years ago
x <- handlr::HandlrClient$new(x = '~/Desktop/mrg.bib')
x$read("bibtex")
#> Error in do_read_bib(file, encoding = .Encoding, srcfile): lex fatal error:
#> input buffer overflow, can't enlarge buffer because scanner uses REJECT
Session Info ```r Session info -------------------------------------------------------------- setting value version R version 3.5.1 (2018-07-02) system x86_64, darwin15.6.0 ui RStudio (1.2.1162) language (EN) collate en_US.UTF-8 tz America/New_York date 2018-12-18 Packages ------------------------------------------------------------------ package * version date source base * 3.5.1 2018-07-05 local bibtex 0.4.2 2017-06-30 CRAN (R 3.5.0) compiler 3.5.1 2018-07-05 local crul 0.6.0 2018-07-10 CRAN (R 3.5.0) curl 3.2 2018-03-28 CRAN (R 3.5.0) datasets * 3.5.1 2018-07-05 local devtools 1.13.6 2018-06-27 CRAN (R 3.5.0) digest 0.6.18 2018-10-10 CRAN (R 3.5.0) graphics * 3.5.1 2018-07-05 local grDevices * 3.5.1 2018-07-05 local handlr * 0.0.4.9210 2018-12-19 Github (ropensci/handlr@0252efc) httpcode 0.2.0 2016-11-14 CRAN (R 3.5.0) httr 1.4.0 2018-12-11 CRAN (R 3.5.0) jsonlite 1.6 2018-12-07 CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0) magrittr 1.5 2014-11-22 CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) methods * 3.5.1 2018-07-05 local packrat 0.4.9-3 2018-06-01 CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) R6 2.3.0 2018-10-04 CRAN (R 3.5.0) Rcpp 1.0.0 2018-11-07 CRAN (R 3.5.0) RefManageR 1.2.0 2018-04-25 CRAN (R 3.5.0) stats * 3.5.1 2018-07-05 local stringi 1.2.4 2018-07-20 CRAN (R 3.5.0) stringr 1.3.1 2018-05-10 CRAN (R 3.5.0) tools 3.5.1 2018-07-05 local triebeard 0.3.0 2016-08-04 CRAN (R 3.5.0) urltools 1.7.1 2018-08-03 CRAN (R 3.5.0) utils * 3.5.1 2018-07-05 local withr 2.1.2 2018-03-15 CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 CRAN (R 3.5.0) ```
sckott commented 5 years ago

can you give the file mrg.bib or at least some subset that makes it reproducible?

sckott commented 5 years ago

same thing happened for me with this example:

z <- system.file('extdata/crossref.bib', package = "handlr")
bibtex_reader(x = z)

but i updated to latest RefManageR on github (remotes::install_github("ropensci/RefManageR")) and now it works. let me know if that works.

it looks like ultimate cause may be in https://github.com/romainfrancois/bibtex/issues/16

yonicd commented 5 years ago

new error after installing, which is what i got with subsets of the original bib file.

Error in do_read_bib(file, encoding = .Encoding, srcfile) : 
  lex fatal error:
fatal flex scanner internal error--end of buffer missed

> packageVersion('RefManageR')
[1] ‘1.2.8’`
sckott commented 5 years ago

i'm pretty sure that's a bibtex pkg problem

if you can share a reproducible example that will help narrow this down

yonicd commented 5 years ago

as it is written in the RefManageR in reprex that problem disappears :)

yonicd commented 5 years ago

here is a snippet of the bib file that is causing problems

@ARTICLE{Jia1996-yu,
    title   = "Errors in time in pharmacokinetic studies",
    author  = "Jia, X and Nedelman, J R",
    journal = "J. Biopharm. Stat.",
    volume  =  6,
    number  =  3,
    pages   = "303--318",
    year    =  1996
}

@MISC{noauthor_2018-bk,
    title        = "{APO-HYDROmorphone} {CR}",
    number       = "Control No: 210830",
    institution  = "Apotex Inc",
    month        =  may,
    year         =  2018,
    howpublished = "Product Monograph"
}

@ARTICLE{Langley2014-ra,
    title    = "Secukinumab in plaque psoriasis--results of two phase 3 trials",
    author   = "Langley, Richard G and Elewski, Boni E and Lebwohl, Mark and
    Reich, Kristian and Griffiths, Christopher E M and Papp, Kim and
    Puig, Llu{\'\i}s and Nakagawa, Hidemi and Spelman, Lynda and
    Sigurgeirsson, B{\'a}r{\dh}ur and Rivas, Enrique and Tsai,
    Tsen-Fang and Wasel, Norman and Tyring, Stephen and Salko, Thomas
    and Hampele, Isabelle and Notter, Marianne and Karpov, Alexander
    and Helou, Silvia and Papavassilis, Charis and {ERASURE Study
        Group} and {FIXTURE Study Group}",
    abstract = "BACKGROUND: Interleukin-17A is considered to be central to the
    pathogenesis of psoriasis. We evaluated secukinumab, a fully
    human anti-interleukin-17A monoclonal antibody, in patients with
    moderate-to-severe plaque psoriasis. METHODS: In two phase 3,
    double-blind, 52-week trials, ERASURE (Efficacy of Response and
    Safety of Two Fixed Secukinumab Regimens in Psoriasis) and
    FIXTURE (Full Year Investigative Examination of Secukinumab vs.
    Etanercept Using Two Dosing Regimens to Determine Efficacy in
    Psoriasis), we randomly assigned 738 patients (in the ERASURE
    study) and 1306 patients (in the FIXTURE study) to subcutaneous
    secukinumab at a dose of 300 mg or 150 mg (administered once
    weekly for 5 weeks, then every 4 weeks), placebo, or (in the
    FIXTURE study only) etanercept at a dose of 50 mg (administered
    twice weekly for 12 weeks, then once weekly). The objective of
    each study was to show the superiority of secukinumab over
    placebo at week 12 with respect to the proportion of patients who
    had a reduction of 75\% or more from baseline in the psoriasis
    area-and-severity index score (PASI 75) and a score of 0 (clear)
    or 1 (almost clear) on a 5-point modified investigator's global
    assessment (coprimary end points). RESULTS: The proportion of
    patients who met the criterion for PASI 75 at week 12 was higher
    with each secukinumab dose than with placebo or etanercept: in
    the ERASURE study, the rates were 81.6\% with 300 mg of
    secukinumab, 71.6\% with 150 mg of secukinumab, and 4.5\% with
    placebo; in the FIXTURE study, the rates were 77.1\% with 300 mg
    of secukinumab, 67.0\% with 150 mg of secukinumab, 44.0\% with
    etanercept, and 4.9\% with placebo (P<0.001 for each secukinumab
    dose vs. comparators). The proportion of patients with a response
    of 0 or 1 on the modified investigator's global assessment at
    week 12 was higher with each secukinumab dose than with placebo
    or etanercept: in the ERASURE study, the rates were 65.3\% with
    300 mg of secukinumab, 51.2\% with 150 mg of secukinumab, and
    2.4\% with placebo; in the FIXTURE study, the rates were 62.5\%
    with 300 mg of secukinumab, 51.1\% with 150 mg of secukinumab,
    27.2\% with etanercept, and 2.8\% with placebo (P<0.001 for each
    secukinumab dose vs. comparators). The rates of infection were
    higher with secukinumab than with placebo in both studies and
    were similar to those with etanercept. CONCLUSIONS: Secukinumab
    was effective for psoriasis in two randomized trials, validating
    interleukin-17A as a therapeutic target. (Funded by Novartis
    Pharmaceuticals; ERASURE and FIXTURE ClinicalTrials.gov numbers,
    NCT01365455 and NCT01358578, respectively.).",
    journal  = "N. Engl. J. Med.",
    volume   =  371,
    number   =  4,
    pages    = "326--338",
    month    =  jul,
    year     =  2014,
    language = "en"
}
sckott commented 5 years ago

thanks.

that example works for me as well

yonicd commented 5 years ago

now it works for me... when i do x$write('citeproc') i get only the first element back. how do i get the nth one?

x$write('citeproc')
{
  "type": "article-journal",
  "id": {},
  "categories": [],
  "language": {},
  "author": [
    {
      "type": "Person",
      "family": "Jia",
      "given": "X",
      "literal": "Jia"
    },
    {
      "type": "Person",
      "family": "Nedelman",
      "given": "J R",
      "literal": "Nedelman"
    }
  ],
  "editor": [],
  "issued": {
    "date-parts": {}
  },
  "submitted": {
    "date-parts": {}
  },
  "abstract": {},
  "container-title": {},
  "DOI": {},
  "issue": {},
  "page": "303318",
  "publisher": {},
  "title": "Errors in time in pharmacokinetic studies",
  "URL": {},
  "version": {},
  "volume": "6"
} 
yonicd commented 5 years ago

btw the page doesn't look like it was parsed right

sckott commented 5 years ago

just back from vacation, will look at this tomorrow

sckott commented 5 years ago

@yonicd install from pluralize branch install.packages("ropensci/handlr@pluralize") - restart, then load the pkg again, then try that example again. That branch should make all formats handle 1 or many - some formats don't have a plural format really that I know of (e.g., RIS), so are written out to separate files if you write to file

yonicd commented 5 years ago

that works. thanks. i'm seeing another problem. may be worth another issue. if i load a bigger bib file (initial comment in this issue) then something is cached and clogs the reader until i refresh the session and then all seems to work again.

> x <- handlr::HandlrClient$new(x = '~/Desktop/mrg.bib') #big bib
> x$read("bibtex")
Error in do_read_bib(file, encoding = .Encoding, srcfile) : 
  lex fatal error:
input buffer overflow, can't enlarge buffer because scanner uses REJECT
> x <- handlr::HandlrClient$new(x = '~/Desktop/test1.bib') #snippet of bib
> x$read("bibtex")
Error in do_read_bib(file, encoding = .Encoding, srcfile) : 
  lex fatal error:
fatal flex scanner internal error--end of buffer missed

Restarting R session...

> x <- handlr::HandlrClient$new(x = '~/Desktop/test1.bib') #snippet of bib
> x$read("bibtex")
> x$write('citeproc')
[
  {
    "type": "article-journal",
    "id": {},
    "categories": [],
    "language": {},
    "author": [
      {
        "type": "Person",
        "family": "Jia",
        "given": "X",
        "literal": "Jia"
      },
      {
        "type": "Person",
        "family": "Nedelman",
        "given": "J R",
        "literal": "Nedelman"
      }
    ],
    "editor": [],
    "issued": {
      "date-parts": {}
    },
    "submitted": {
      "date-parts": {}
    },
    "abstract": {},
    "container-title": {},
    "DOI": {},
    "issue": {},
    "page": "303318",
    "publisher": {},
    "title": "Errors in time in pharmacokinetic studies",
    "URL": {},
    "version": {},
    "volume": "6"
  },
  {
    "type": "misc",
    "id": {},
    "categories": [],
    "language": {},
    "author": [],
    "editor": [],
    "issued": {
      "date-parts": {}
    },
    "submitted": {
      "date-parts": {}
    },
    "abstract": {},
    "container-title": {},
    "DOI": {},
    "issue": {},
    "page": "",
    "publisher": {},
    "title": "{APO-HYDROmorphone} {CR}",
    "URL": {},
    "version": {},
    "volume": {}
  },
  {
    "type": "article-journal",
    "id": {},
    "categories": [],
    "language": {},
    "author": [
      {
        "type": "Person",
        "family": "Langley",
        "given": "Richard G",
        "literal": "Langley"
      },
      {
        "type": "Person",
        "family": "Elewski",
        "given": "Boni E",
        "literal": "Elewski"
      },
      {
        "type": "Person",
        "family": "Lebwohl",
        "given": "Mark",
        "literal": "Lebwohl"
      },
      {
        "type": "Person",
        "family": "Reich",
        "given": "Kristian",
        "literal": "Reich"
      },
      {
        "type": "Person",
        "family": "Griffiths",
        "given": "Christopher E M",
        "literal": "Griffiths"
      },
      {
        "type": "Person",
        "family": "Papp",
        "given": "Kim",
        "literal": "Papp"
      },
      {
        "type": "Person",
        "family": "Puig",
        "given": "Lluís",
        "literal": "Puig"
      },
      {
        "type": "Person",
        "family": "Nakagawa",
        "given": "Hidemi",
        "literal": "Nakagawa"
      },
      {
        "type": "Person",
        "family": "Spelman",
        "given": "Lynda",
        "literal": "Spelman"
      },
      {
        "type": "Person",
        "family": "Sigurgeirsson",
        "given": "Bárður",
        "literal": "Sigurgeirsson"
      },
      {
        "type": "Person",
        "family": "Rivas",
        "given": "Enrique",
        "literal": "Rivas"
      },
      {
        "type": "Person",
        "family": "Tsai",
        "given": "Tsen-Fang",
        "literal": "Tsai"
      },
      {
        "type": "Person",
        "family": "Wasel",
        "given": "Norman",
        "literal": "Wasel"
      },
      {
        "type": "Person",
        "family": "Tyring",
        "given": "Stephen",
        "literal": "Tyring"
      },
      {
        "type": "Person",
        "family": "Salko",
        "given": "Thomas",
        "literal": "Salko"
      },
      {
        "type": "Person",
        "family": "Hampele",
        "given": "Isabelle",
        "literal": "Hampele"
      },
      {
        "type": "Person",
        "family": "Notter",
        "given": "Marianne",
        "literal": "Notter"
      },
      {
        "type": "Person",
        "family": "Karpov",
        "given": "Alexander",
        "literal": "Karpov"
      },
      {
        "type": "Person",
        "family": "Helou",
        "given": "Silvia",
        "literal": "Helou"
      },
      {
        "type": "Person",
        "family": "Papavassilis",
        "given": "Charis",
        "literal": "Papavassilis"
      },
      {
        "type": "Person",
        "family": "ERASURE Study Group",
        "given": "",
        "literal": "ERASURE Study Group"
      },
      {
        "type": "Person",
        "family": "FIXTURE Study Group",
        "given": "",
        "literal": "FIXTURE Study Group"
      }
    ],
    "editor": [],
    "issued": {
      "date-parts": {}
    },
    "submitted": {
      "date-parts": {}
    },
    "abstract": "BACKGROUND: Interleukin-17A is considered to be central to the\n\tpathogenesis of psoriasis. We evaluated secukinumab, a fully\n\thuman anti-interleukin-17A monoclonal antibody, in patients with\n\tmoderate-to-severe plaque psoriasis. METHODS: In two phase 3,\n\tdouble-blind, 52-week trials, ERASURE (Efficacy of Response and\n\tSafety of Two Fixed Secukinumab Regimens in Psoriasis) and\n\tFIXTURE (Full Year Investigative Examination of Secukinumab vs.\n\tEtanercept Using Two Dosing Regimens to Determine Efficacy in\n\tPsoriasis), we randomly assigned 738 patients (in the ERASURE\n\tstudy) and 1306 patients (in the FIXTURE study) to subcutaneous\n\tsecukinumab at a dose of 300 mg or 150 mg (administered once\n\tweekly for 5 weeks, then every 4 weeks), placebo, or (in the\n\tFIXTURE study only) etanercept at a dose of 50 mg (administered\n\ttwice weekly for 12 weeks, then once weekly). The objective of\n\teach study was to show the superiority of secukinumab over\n\tplacebo at week 12 with respect to the proportion of patients who\n\thad a reduction of 75\\% or more from baseline in the psoriasis\n\tarea-and-severity index score (PASI 75) and a score of 0 (clear)\n\tor 1 (almost clear) on a 5-point modified investigator's global\n\tassessment (coprimary end points). RESULTS: The proportion of\n\tpatients who met the criterion for PASI 75 at week 12 was higher\n\twith each secukinumab dose than with placebo or etanercept: in\n\tthe ERASURE study, the rates were 81.6\\% with 300 mg of\n\tsecukinumab, 71.6\\% with 150 mg of secukinumab, and 4.5\\% with\n\tplacebo; in the FIXTURE study, the rates were 77.1\\% with 300 mg\n\tof secukinumab, 67.0\\% with 150 mg of secukinumab, 44.0\\% with\n\tetanercept, and 4.9\\% with placebo (P<0.001 for each secukinumab\n\tdose vs. comparators). The proportion of patients with a response\n\tof 0 or 1 on the modified investigator's global assessment at\n\tweek 12 was higher with each secukinumab dose than with placebo\n\tor etanercept: in the ERASURE study, the rates were 65.3\\% with\n\t300 mg of secukinumab, 51.2\\% with 150 mg of secukinumab, and\n\t2.4\\% with placebo; in the FIXTURE study, the rates were 62.5\\%\n\twith 300 mg of secukinumab, 51.1\\% with 150 mg of secukinumab,\n\t27.2\\% with etanercept, and 2.8\\% with placebo (P<0.001 for each\n\tsecukinumab dose vs. comparators). The rates of infection were\n\thigher with secukinumab than with placebo in both studies and\n\twere similar to those with etanercept. CONCLUSIONS: Secukinumab\n\twas effective for psoriasis in two randomized trials, validating\n\tinterleukin-17A as a therapeutic target. (Funded by Novartis\n\tPharmaceuticals; ERASURE and FIXTURE ClinicalTrials.gov numbers,\n\tNCT01365455 and NCT01358578, respectively.).",
    "container-title": {},
    "DOI": {},
    "issue": {},
    "page": "326338",
    "publisher": {},
    "title": "Secukinumab in plaque psoriasis--results of two phase 3 trials",
    "URL": {},
    "version": {},
    "volume": "371"
  }
]
sckott commented 5 years ago

thanks - i'll see if i can replicate the problem, if you can't share the bib file, can you at least say how many lines or how many citations are in the file

yonicd commented 5 years ago

6430 citations 142751 lines

sckott commented 5 years ago

thanks

sckott commented 5 years ago

contacted bibtex author, hopefully he'll get back soon if there's a fix for bibtex error

sckott commented 5 years ago

reporod. eg with large file:

~ 11K citations ~ 149K rows

x <- handlr::HandlrClient$new(x = '/Users/sckott/github/rosadmin/citations/citations.txt')
x$read("bibtex")
length(x$parsed)
#> [1] "11960"
z <- x$write('citeproc')
class(z)
#> [1] "json"

This works fine for me.

citations.txt

yonicd commented 5 years ago

Cool. I’ll try to see where my bib file fails.

It was created by paperpile, so it could be on their end too...

Thanks for the follow up!

sckott commented 5 years ago

pulling dependency on dev RefManageR for now - falling back to CRAN version for the push of the first version of this pkg to CRAN - will bring back dev version of RefManageR after the push to cran

GeraldCNelson commented 5 years ago

I'm getting a similar error message with a large BibTeX file. The file is 16.9 mb. The code I used is

x <- handlr::HandlrClient$new(x = 'data-raw/consumption_14_19.bib')
x$read("bibtex")

The error message is

Error in do_read_bib(file, encoding = .Encoding, srcfile) : 
  lex fatal error:
fatal flex scanner internal error--end of buffer missed

consumption_14_19.bib was originally a .ris file that I converted using BibDesk. When I tried to read the .ris file in handlr with the following code, I get the following error message

> x <- handlr::HandlrClient$new(x = 'data-raw/consumption_14_19.ris')
Error in private$guess_format(x) : 
  could not guess format for string; specify format

I'm using the GitHub version of handlr.

sckott commented 5 years ago

thx for the report @GeraldCNelson , i was going to ask if you have dev version of RegManageR, from the other issue it appears you do have it. that fatal flex scanner error comes from bibtex, and it seems there's no fix in sight unfortunately. I'm still looking into ways to fix the issues with large bib files.

for the could not guess format error, can you share at least a subset of that file so I can see what th issue is

sckott commented 4 years ago

both RefManageR and bibtex pkgs have been updated. hopefully that sorts out the issue here.