racket / scribble

Other
197 stars 90 forks source link

bibtex parse-pages may be a little too restrictive #302

Open wusunlab opened 3 years ago

wusunlab commented 3 years ago

The regex matching in parse-pages restricts BibTeX pages field to two numbers separated by one or more hyphens. However, I occasionally encounter journals that use a single number or identifier in the pages field, or those that use letters in the page numbers, or those that use an en-dash instead of two hyphens as the separator. These edge cases would all cause Scribble to fail.

https://github.com/racket/scribble/blob/beb2d9834169665121d34b5f3195ddf252c3c998/scribble-lib/scriblib/bibtex.rkt#L439

(define (parse-pages ps)
  (match ps
    [(regexp #rx"^([0-9]+)\\-+([0-9]+)$" (list _ f l))
     (list f l)]
    [#f
     #f]
    [_
     (error 'parse-pages "Invalid page format ~e" ps)]))

I'm not sure if it would be in the best interest of the developers to spend time fixing these edge cases. Would it be possible to have an option to override parse-pages (or make it a warning instead of an error)?

mfelleisen commented 3 years ago

I think your comments confuse the rendered output information (e.g., en-dash or rendered page ranges) with an easy to type (on my keyboard) data representation. BUT, I do agree that you have a point about using letters in page ranges. If I wanted to point to a text fragment in the preface of a book, it would be, say, '[xi x]. Why don't you tease those aspects apart in a bulletined list and we'll see what we can do.

brittAnderson commented 3 years ago

I just ran into this issue today. There are some pretty prestigious journals that break the pattern of beginning and ending pages. For example here is a reference from Plos Comp Biol.

@article{xu21_novel_is_not_surpr,
      author =   {He A. Xu and Alireza Modirshanechi and Marco P.
                  Lehmann and Wulfram Gerstner and Michael H. Herzog},
  title =    {Novelty Is Not Surprise: Human Exploratory and
                  Adaptive Behavior in Sequential Decision-Making},
  journal =  {PLOS Computational Biology},
  volume =   17,
  number =   6,
  pages =    {e1009070},
  year =     2021,
  doi =      {10.1371/journal.pcbi.1009070xu21_novel_is_not_surpr},
  url =      {https://doi.org/10.1371/journal.pcbi.1009070},
  DATE_ADDED =   {Sat Jun 12 08:00:50 2021},
}

As you an see this journal reports the page with a leading letter and only one number. As it is this stops all the compilation of the scribble document. The preferred behavior for the case where the parse fails is to insert some place holder text in to the page field with maybe a warning to stdout.

Here is a quick hack that shows something closer to what I would like. Here we get the correct data into the final reference, though there is still work to be done on the output formatting side of things to get rid of the redundant entry after the hyphen.

(define (parse-pages ps)
  (match ps
    [(regexp #rx"^([0-9]+)\\-+([0-9]+)$" (list _ f l))
     (list f l)]
    [(regexp #rx"^([a-z0-9]+)$" (list a b))
     (list ps ps)]
    [#f
     (list "no page data " " no page data")]
    [_
     (error 'parse-pages "Invalid page format ~e" ps)]))

It gives output like:

Arthur Prat-Carrabin, Florent Meyniel, Misha Tsodyks, and Rava Azeredo da Silveira. Biases and Variability From Costly Bayesian Inference. CoRR, pp. no page data – no page data, 2021. http://arxiv.org/abs/2107.03231v1

He A. Xu, Alireza Modirshanechi, Marco P. Lehmann, Wulfram Gerstner, and Michael H. Herzog. Novelty Is Not Surprise: Human Exploratory and Adaptive Behavior in Sequential Decision-Making. PLOS Computational Biology 17(6), pp. e1009070–e1009070, 2021. https://doi.org/10.1371/journal.pcbi.1009070