retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.19k stars 284 forks source link

Some bibtex entries quietly discarded on import from bib file #873

Closed rpgoldman closed 6 years ago

rpgoldman commented 6 years ago

Bug classification

Problem with import. Tried to import large bib file. At least one entry, here:

@Book{books/mk/Ginsberg93,
  author =       GINSBERG,
  title =        "Essentials of Artificial Intelligence",
  publisher =    KAUFMANN,
  year =         "1993",
  address =      KAUFMANN-ADDRESS,
  topic =        "AI-intro;AI-text;",
  ISBN =         "1-55860-221-6"
}

failed to appear in Zotero after import (searching for "Essentials" finds nothing), although the import seems to complete. I suppose that the use of @string defined abbreviations (GINSBERG, KAUFMANN, KAUFMANN-ADDRESS) might be at fault, but if that's the problem, shouldn't the import raise an error?

Non-export problems with BBT

If your issue is a bug report, but not for exports, restart Zotero with debugging enabled (Help -> Debug Output Logging -> Restart with logging enabled), reproduce your problem, and select "Report Better BibTeX error" from the help menu, and post the resulting report ID (shown in red after you submit) here.

Report ID: H68P37J8

retorquere commented 6 years ago

In the log I see some mentions of dropped fields (which I'll look at later), but otherwise, this imports without issue. Did you not get a note among the imported references? BBT documents errors in a freestanding (newly imported) note rather than throwing an error.

The @String references should import without issue, but not from this freestanding sample of course, as it doesn't include the @string definitions; during import, the references are assumed to be strings if they can't be resolved, so I get a published by the name of KAUFMANN after import.

rpgoldman commented 6 years ago

Ah! I didn't know to look for this note. Indeed there is one:

Import errors found:

    line 496: found "(", expected ")"
    line 1954: found "L", expected "{"
    line 10365: found "", expected "}"

The material around 496 is as follows:

@book(Warren:78,
    author =    {Beatrice Warren},
    year =      {1978},
    publisher = {Acta Universitatis Gothoburgen},
    title =     "Semantic Patterns of Noun-Noun Compound",
    series =    {Gothenburg Studies in English},
    volume =    41  
)

@techreport(Woods:78,  <--- 496
    author =    {William A. Woods},
    address =   {Cambridge, Mass.},
    year =      {January 1978},
    institution =   {Bolt Beranek and Newman},
    title =     {Research in Natural Language Understanding: 
             Quarterly Technical Progress Report No. 1,1}
)

Any chance BBT doesn't like the "naked" 41 in the preceding entry?

At line 1954, BBT seems to not like a comment:

@Comment Len Schubert

That @Comment prefix is what emacs puts in when you do comment-region, and it seems to be both correct bibtex and is accepted by Emacs's parsebib.

The final error is at the end of the file, right after:

@Book{books/mk/Ginsberg93,
  author =       GINSBERG,
  title =        "Essentials of Artificial Intelligence",
  publisher =    KAUFMANN,
  year =         "1993",
  address =      KAUFMANN-ADDRESS,
  topic =        "AI-intro;AI-text;",
  ISBN =         "1-55860-221-6"
}

Hope that helps. If you would like, I could probably simply upload the whole file.

retorquere commented 6 years ago

Yep, that's helpful. How many entries are we talking about in total?

retorquere commented 6 years ago

@njbart, do you know what the expected behavior of braceless @comments is? The biblatex manual makes no mention of @comment, and Tame the BeaST says

The main use of such an entry type is to comment a large part of the bibliography easily, since anything outside an entry is already a comment, and commenting out one entry may be achieved by just removing its initial @.

In what sense could it be used to comment out "a large part"?

retorquere commented 6 years ago

Or do you know, @rpgoldman ?

rpgoldman commented 6 years ago

ai.bib.txt Comments

I'm not sure about @comment, because bibtex is so vague about its syntax. My understanding is that you are allowed to put any old garbage into a bib file, and bibtex is supposed to just skip anything that isn't a recognized entry of the form

@keyword brace citekey field* _matchingbrace

I put in the @comment prefixes because Emacs's parsebib ( https://github.com/joostkremers/parsebib ) choked when I expected it to treat something like

Papers by Judea Pearl

as a comment.

I have heard conflicting accounts about whether the hashmark is a comment character. TBH, I don't know how to comment out a big block of bibtex.

My ai.bib file

grep and wc indicate I have approximately 1000 entries in my bib file. I will attach it to this issue (renamed to ai.bib.txt because github doesn't know about .bib files).

retorquere commented 6 years ago

If parsebib chokes on text outside references unless it's not prefixed by @comment, it's clearly not parsing bibtex properly. Tame the BeaST states clearly states.

anything outside an entry is already a comment, and commenting out one entry may be achieved by just removing its initial @.

On reading that, I think the "big block of text" probably refers to the possibility of doing

@comment{

@misc{...}

@letter{...}

}

but that still doesn't tell me what the expected behavior of braceless @comments is supposed to be.

Hash marks (#) are just more random text outside the reference so they are ipso facto part of the comment, not a start of a comment.

Anyhow, BBT will merrily parse anything that's outside a reference. The only thing I don't know right now is what to make of the braceless @comment. Is it mean to be an until-end-of-line comment? Something else?

I have the 41 issue solved, it wasn't the 41 but the tab character behind it that I wasn't handling properly. The other problems are somewhat likely to be fallout from not handling the bare @comment, so as soon as I know what to do with that, I can move on to test those.

rpgoldman commented 6 years ago

The braceless comment I assumed was "comment till end of line, because that's what emacs's bibtex mode put in for me when I selected a region and did comment-region. I have replaced my braceless comments with comments that do have braces.

I also removed the stray tab character. After that I still see two import failures:

Import errors found:

    line 496: found "(", expected ")"
    line 10365: found "", expected "}"

I believe that this means that the tab character was not to blame. Putting quotes around the 41 gets us past that error, leaving only the one on line 10365. I'm uploading a new copy of the fixed bibliography.

Interestingly, while I get only one error, it turns out that a ton of the bib entries are lost. I note the following:

I am not certain, but it seems like maybe BBT doesn't like % characters. Unfortunately, these should be acceptable in comments and are definitely acceptable in URLs.

I removed them all... and now BBT just complains that my file is ill-formed!

ai.bib.txt

retorquere commented 6 years ago

I am sure the tab character was what tripped up the parser. I wouldn't worry about that right now, I'm just feeding the original bib file through the parser to eliminate errors one by one. You can hold off until I have those handled.

retorquere commented 6 years ago

There's one entry in that bib file that's going to be very hard to deal with:

  @TECHREPORT(Thiebaux93,
  AUTHOR = {Sylvie Thi\{'}ebaux and Joachim Hertzberg
      and William Shoaff and Moti Schneider},
  TITLE = {A Stochastic Model of Actions and Plans
     for Anytime Planning Under Uncertainty},
  INSTITUTION = {ICSI},
  YEAR = {1993},
  NUMBER = {TR--93--027},
  MONTH = {May},
)

The author field has an error; the } after \{' closes the author field and it all goes south from there.

I am honestly a little surprised (and not in a good way) that JabRef would export this without warning. Doesn't JabRef do basic checking on field contents?

retorquere commented 6 years ago

Same goes for

@Book{BCMNS2003,
  title =   "The Description Logic Handbook --- Theory,
         Implementation and Applications",
  URL =     "http://titles.cambridge.org/catalogue.asp?ISBN=0521781760",
  added-by =    "msteiner",
  added-at =    "Thu Feb 5 17:23:38 2004",
  editor =  "Franz Baader and Diego Calvanese and Deborah
         McGuinness and Daniele Nardi and Peter
         Patel-Schneider",
  offline = "ISBN: 0521781760",
  abstract =    "Description Logics are a family of knowledge
         representation languages that have been studied
         extensively in Artificial Intelligence over the last
         two decades. They are embodied in several
         knowledge-based systems and are used to develop various
         real-life applications. The Description Logic Handbook
         provides a thorough account of the subject, covering
         all aspects of research in this field, namely: theory,
         implementation, and applications. Its appeal will be
         broad, ranging from more theoretically-oriented
         readers, to those with more practically-oriented
         interests who need a sound and modern understanding of
         knowledge representation systems based on Description
         Logics. The chapters are written by some of the most
         prominent researchers in the field, introducing the
         basic technical material before taking the reader to
         the current state of the subject, and including
         comprehensive guides to the literature. In sum, the
         book will serve as a unique reference for the subject,
         and can also be used for self-study or in conjunction
         with Knowledge Representation and Artificial
         Intelligence courses.",
  publisher =   "Cambridge University Press",
  year =    "2003",
  annote =  "Contents: 1. An introduction to description logics D.
         Nardi and R. J. Brachman; Part I. Theory: 2. Basic
         description logics F. Baader and W. Nutt; 3. Complexity
         of reasoning F. M. Donini; 4. Relationships with other
         formalisms U. Sattler, D. Calvanese and R. Molitor; 5.
         Expressive description logics D. Calvanese and G. De
         Giacomo; 6. Extensions to description logics F. Baader,
         R. K{\"u}sters and F. Wolter; Part II. Implementation:
         7. From description logic provers to knowledge
         representation systems D. L. McGuinness and P. F.
         Patel-Schneider; 8. Description logics systems R.
         M{\"o}ller and V. Haarslev; 9. Implementation and
         optimisation techniques I. Horrocks; Part III.
         Applications: 10. Conceptual modeling with description
         logics A. Borgida and R. J. Brachman; 11. Software
         engineering C. Welty; 12. Configuration D. L.
         McGuinness; 13. Medical informatics A. Rector; 14.
         Digital libraries and web-based information systems I.
         Horrocks, D. L. McGuinness and C. Welty; 15. Natural
         language processing E. Franconi; 16. Description logics
         for data bases A. Borgida, M. Lenzerini and R. Rosati;
         Appendix. Description logic terminology F. Baader;
         Bibliography. See also
         \cite{ href="http://www.inf.unibz.it/%7efranconi/dl/course/">http://www.inf.unibz.it/~franconi/dl/course/}",
}

which where the quote character in href="http://www closes the field and the parser gets confused from there. But these two really are just malformed bibtex, and that makes the file as such malformed.

All the rest I can parse now.

rpgoldman commented 6 years ago

Thanks! I'll check all those accents to make sure they are correct. I commented out that \cite{} oddity (came from CSBibs entry) earlier. I'll confirm when it's all working.

retorquere commented 6 years ago

Not yet -- I have fixed the parser, which is going through its tests now. When that passes, I'll build a new BBT that has the parser and it will be posted here.

rpgoldman commented 6 years ago

This almost works for me. Now I'm finding repeated crashes where Zotero (?) or BBT (?) doesn't like ill-formed URLs in the url field.

retorquere commented 6 years ago

Again -- hold off, a new version will be out tomorrow which deals with these issues. The first order of business was to get the input to parse.

retorquere commented 6 years ago

Crashes, though? As in Zotero goes down in flames?

rpgoldman commented 6 years ago

sorry -- not "goes down in flames," but "fails to import anything at all" instead of just throwing away or annotating the bad URLs.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5325 ("test cases for #873").

retorquere commented 6 years ago

Right, give 5325 go.

njbart commented 6 years ago

Patashnik’s “BibTeXing” (http://mirrors.ctan.org/biblio/bibtex/base/btxdoc.pdf) says:

For Scribe compatibility, the database files allow an @COMMENT command; it’s not really needed because BibTEX allows in the database files any comment that’s not within an entry. If you want to comment out an entry, simply remove the ‘@’ character preceding the entry type.

(No idea what Scribe is/was …)

In http://artis.imag.fr/~Xavier.Decoret/resources/xdkbibtex/bibtex_summary.html, there’s an interesting section on “Comments” which claims that

@comment{
@misc{...}
@letter{...}
}

does not work (haven’t tested this myself). As to a braceless @comment, my feeling is that without any accompanying begin/end tags it can hardly apply to more than the line it appears in – but, honestly, I don’t really know.

Ultimately, the only valid test is studying the source code and/or the behaviour of bibtex and biber (the programs).

As https://github.com/aclements/biblib puts it:

There are a lot of BibTeX parsers out there. Most of them are complete nonsense based on some imaginary grammar made up by the module's author that is almost, but not quite, entirely unlike BibTeX's actual grammar. BibTeX has a grammar. It's even pretty simple, though it's probably not what you think it is. The hardest part of BibTeX's grammar is that it's only written down in one place: the BibTeX source code.

So I guess it would be best if someone ran a few tests through bibtex and biber.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5326 ("large tests on nightly").

retorquere commented 6 years ago

Scribe is most likely https://en.wikipedia.org/wiki/Scribe_(markup_language)

If

@comment{
@misc{...}
@letter{...}
}

doesn't work then I have no idea what TTB means with "commenting out large blocks of text".

In any case, BBTs job is to be lenient, so currently it parses most of the original ai.bib, save for the two really broken references (which I'm pretty miffed JabRef didn't flag).

rpgoldman commented 6 years ago

I didn't want to get into it, but I have definitely had problems trying to use @comment{ ... } to comment out blocks. Sufficient problems that I gave up using it. Unfortunately, I didn't write down what failed me...

rpgoldman commented 6 years ago

This still fails for me. I get an error on the attached version of ai.bib, and nothing is imported. Looking in the debug log, it looks like something might be trying to parse URLs and errors out on a bad one:

(1)(+0001298): { "type": "unknown_uri" "entry": "bai-fri-mci-icaps07" "field_name": "url" "value": "bai-fri-mci-icaps07.pdf" "line": 9736 }

(2)(+0000000): Translate: Translation using Better BibTeX failed: type => unknown_uri entry => bai-fri-mci-icaps07 field_name => url value => bai-fri-mci-icaps07.pdf line => 9736 string => [object Object] url => /Users/rpg/refs/ai.bib downloadAssociatedFiles => true automaticSnapshots => true

(5)(+0000000): Translate: Running handler 0 for error

(1)(+0000002): { "type": "unknown_uri" "entry": "bai-fri-mci-icaps07" "field_name": "url" "value": "bai-fri-mci-icaps07.pdf" "line": 9736 }

(3)(+0000000): Alert: An error occurred while trying to import the selected file. Please ensure that the file is valid and try again.

ai.bib.txt

When I remove that ill-formed URL, everything seems to be well. AFAICT all the entries seem to be successfully parsed.

It seems actually reasonable that Zotero might be more stringent about URLs than Bibtex, which really just has to move them from input to output. I suspect this one crept in because I was (mis)using the URL field as if it was a "file" field. Thanks for all of your help!

rpgoldman commented 6 years ago

I'm not sure why this issue was auto-reopened.

retorquere commented 6 years ago

blip-bloop reopens any issue that wasn't closed by me; I like to keep issues open so I have a reminder for wrap-up work.

It won't be zotero that's complaining about the url, that would be bbt, which it shouldn't, and even then it should important all other references. The weird thing is that I've added the original ai.bib with the two fixes in the test set and that imports them all (985 I think). So there must be some difference between your situation and mine that I'm not handling properly. I'll take a look at the later ai.bib you posted.

retorquere commented 6 years ago

You should not get these uri errors with the new parser, I've disabled them. What version did you import this with?

rpgoldman commented 6 years ago

I got those results with 5.0.73

retorquere commented 6 years ago

You have to try with 5325 or 5326 posted in this thread. 5.0.73 doesn't have these latest fixes yet.

rpgoldman commented 6 years ago

Tested this with the 5326 build (I didn't fully understand how the build bot worked), and it seems fine, thanks.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5351 ("cleanup").

retorquere commented 6 years ago

5.0.74 has the changes.

lock[bot] commented 6 years ago

This thread has been automatically locked because it has not had recent activity. Please open a new issue for related bugs and link to relevant comments in this thread.