ropensci / bib2df

Parse a BibTeX file to a tibble
https://docs.ropensci.org/bib2df
99 stars 22 forks source link

Parsing .bib fails when field separator is on the next line #56

Open cwverhey opened 2 years ago

cwverhey commented 2 years ago

bib2df::bib2df() fails to load fields when the field separator (",") is preceded by a newline, as in the following example:

@article{SHBP
,title = "Efficient DC Analysis of RVJ Circuits for Moment and Derivative Commutations of Interconnect Networks"
,author = " S. H. Batterywala and H. Narayanan "
,journal = "12th International Conference on VLSI Design"
,pages = "169-174"
,year = 1999
}

reprex:

f <- tempfile()
download.file('https://www.ee.iitb.ac.in/~trivedi/LatexHelp/Docs/ref.bib', f)
bib2df::bib2df(f)

With version 1.1.1 it loads in new columns "X.≪fieldname≫":

# A tibble: 9 × 41
  CATEGORY    BIBTE…¹ ADDRESS ANNOTE AUTHOR BOOKT…² CHAPTER CROSS…³ EDITION EDITOR HOWPU…⁴ INSTI…⁵ JOURNAL KEY   MONTH NOTE  NUMBER ORGAN…⁶
  <chr>       <chr>   <chr>   <chr>  <list> <chr>   <chr>   <chr>   <chr>   <list> <chr>   <chr>   <chr>   <chr> <chr> <chr> <chr>  <chr>  
1 ARTICLE     SHBP    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
2 ARTICLE     SIE     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
3 BOOK        HN      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
4 BOOK        DON     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
5 MASTERSTHE… GAK     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
6 MASTERSTHE… GT      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
7 MASTERSTHE… NJB     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
8 MANUAL      PVM     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
9 MISC        PVMS    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
# … with 23 more variables: PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <chr>,
#   X.TITLE <chr>, X.AUTHOR <chr>, X.JOURNAL <chr>, X.PAGES <chr>, X.YEAR <chr>, X.VOLUME <chr>, X.NUMBER <chr>, X.PUBLISHER <chr>,
#   X.MONTH <chr>, X.SCHOOL <chr>, X.ORGANIZATION <chr>, X.ADDRESS <chr>, X.NOTE <chr>, X.KEY <chr>, X.HOWPUBLISHED <chr>, and abbreviated
#   variable names ¹​BIBTEXKEY, ²​BOOKTITLE, ³​CROSSREF, ⁴​HOWPUBLISHED, ⁵​INSTITUTION, ⁶​ORGANIZATION

With version 1.1.2 it doesn't load at all (all values are either NA, character(0) or an empty string):

# A tibble: 9 × 26
  CATEGORY    BIBTE…¹ ADDRESS ANNOTE AUTHOR BOOKT…² CHAPTER CROSS…³ EDITION EDITOR HOWPU…⁴ INSTI…⁵ JOURNAL KEY   MONTH NOTE  NUMBER ORGAN…⁶
  <chr>       <chr>   <chr>   <chr>  <list> <chr>   <chr>   <chr>   <chr>   <list> <chr>   <chr>   <chr>   <chr> <chr> <chr> <chr>  <chr>  
1 ARTICLE     SHBP    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      ""      NA    NA    NA    NA     NA     
2 ARTICLE     SIE     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      ""      NA    NA    NA    ""     NA     
3 BOOK        HN      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
4 BOOK        DON     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
5 MASTERSTHE… GAK     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
6 MASTERSTHE… GT      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
7 MASTERSTHE… NJB     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
8 MANUAL      PVM     ""      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    ""    NA     ""     
9 MISC        PVMS    NA      NA     <chr>  NA      NA      NA      NA      <chr>  ""      NA      NA      ""    NA    NA    NA     NA     
# … with 8 more variables: PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <chr>,
#   and abbreviated variable names ¹​BIBTEXKEY, ²​BOOKTITLE, ³​CROSSREF, ⁴​HOWPUBLISHED, ⁵​INSTITUTION, ⁶​ORGANIZATION

I am not sure how common this is (probably not at all), but this did happen on the first example .bib I found online and it seems like a basic parsing error.

jeanetteclark commented 3 months ago

I've found a related issue where the parsing fails if there is no newline.

bib <- "@Article{Binmore2008, Title = {Do Conventions Need to Be Common Knowledge?}, Author = {Binmore, Ken}, Journal = {Topoi}, Year = {2008}, Number = {1}, Pages = {17--27}, Volume = {27}}"

t <- tempfile()
writeLines(bib, t)

df <- bib2df::bib2df(t)
# A tibble: 1 × 27
  CATEGORY BIBTEXKEY   ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION EDITOR HOWPUBLISHED INSTITUTION JOURNAL
  <chr>    <chr>       <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>   <list> <chr>        <chr>       <chr>  
1 ARTICLE  Rangel_2023 NA      NA     <chr>  NA        NA      NA       NA      <chr>  NA           NA          NA     
# ℹ 14 more variables: KEY <chr>, MONTH <chr>, NOTE <chr>, NUMBER <chr>, ORGANIZATION <chr>, PAGES <chr>,
#   PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <chr>, ARTICLE <chr>

It definitely did not do this previously, I'm returning to an old project and none of the code is working correctly :(

giabaio commented 3 months ago

I think this can be fixed with

bib <- "@Article{Binmore2008, Title = {Do Conventions Need to Be Common Knowledge?}, Author = {Binmore, Ken}, Journal = {Topoi}, Year = {2008}, Number = {1}, Pages = {17--27}, Volume = {27}}"

t <- tempfile()
writeLines(gsub(",",",\n",bib), t)

df <- bib2df::bib2df(t)

which gives

df
# A tibble: 1 × 26
  CATEGORY BIBTEXKEY  ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION EDITOR HOWPUBLISHED INSTITUTION JOURNAL KEY   MONTH NOTE  NUMBER ORGANIZATION PAGES PUBLISHER
  <chr>    <chr>      <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>   <list> <chr>        <chr>       <chr>   <chr> <chr> <chr> <chr>  <chr>        <chr> <chr>    
1 ARTICLE  Binmore20… NA      NA     <chr>  NA        NA      NA       NA      <chr>  NA           NA          Topoi   NA    NA    NA    1      NA           17--… NA       
# ℹ 6 more variables: SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <dbl>

Does this help?

jeanetteclark commented 3 months ago

No, that does not work because then the entry can look like this:

 @article{Broadman_2020,
 title={Coupled impacts of sea ice variability and North Pacific atmospheric circulation on Holocene hydroclimate in Arctic Alaska},
 volume={117},
 ISSN={1091-6490},
 url={http://dx.doi.org/10.1073/PNAS.2016544117},
 DOI={10.1073/pnas.2016544117},
 number={52},
 journal={Proceedings of the National Academy of Sciences},
 publisher={Proceedings of the National Academy of Sciences},
 author={Broadman,
 Ellie and Kaufman,
 Darrell S. and Henderson,
 Andrew C. G. and Malmierca-Vallet,
 Irene and Leng,
 Melanie J. and Lacey,
 Jack H.},

I've put in a lot of time basically refactoring the entire parsing process, and I do think it is more robust, but I am unfortunately stuck without a good workaround on one of the tests - the one allowing = in field values. I could potentially move forward if we were to decide on a list of allowed field names, however.