inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
626 stars 292 forks source link

RFC Generic BibTeX export #662

Closed kaplun closed 8 years ago

kaplun commented 10 years ago

Originally on 2011-06-06

Currently BibTeX export is generated out of the box, by using a single output format, calling a single format element, namely bfe_bibtex.py. This imply the business-logic of the BibTeX export is all hard-coded in one place and can not easily be adapted to the most different document type. This has an impact, e.g. in CDS, when exporting Thesis documents.

One possible solution would be to have generic bfe_bibtex_field.py, based on bfe_bibtex.format_bibtex_field, able to format certain subfields, and to provide demo BibTeX exports, that are adapted to the specific document type.

kaplun commented 10 years ago

Originally on 2011-06-06

Some more on the subject (from Savannah) [...] Request came from a user (Yngve Inntjore Levinsen@cern.ch) : I have a couple of proposals to the bibtex formatting you provide. First off, I would like to propose a new entry "url", that equals the CDS url (or if you think it is more correct, directly to the ps/pdf document). Most bibtex readers support this tag, so that you get a link to the web page and/or document in this url. The second I would like to propose is that instead of putting the oai: entry in the title, you create a new tag called oai. This would then eventually work in similar manner as the doi tag already does, and not clutter up the title. You could also consider to use the tag "abstract" that many bibtex readers already know of. This is not necessary when you want to use the bibtex to reference in a paper, but it is nice when you have your own bibtex library of the documents that are relevant to you. Myself I have hundreds of papers in my bibtex file already, and use it for storing all articles that might be of interest to me now or in the future. In order to explain what I mean, I attach a bibtex proposal to this article: http://cdsweb.cern.ch/record/1299163

@article{Gschwendtner:1299163, 
   author = "Gschwendtner, E and Apyan, A and Elsener, K and Sailer, A and Uythoven, J and  Appleby, R B and Salt, M and Ferrari, A and Ziemann, V", 
   title = "The ClIC Post-Collision Line.", 
   number = "EuCARD-CON-2010-030", 
   year = "2010", 
   url = "http://cdsweb.cern.ch/record/1299163", 
   oai = "cds.cern.ch:1299163", 
   abstract = "The 1.5 TeV CLIC beams, with a total power of 14 MW per beam, are disrupted at the interaction point due to the very strong beam-beam effect. As a result, some 3.5 MW reach the main dump in form of beamstrahlung photons. About 0.5 MW of e+e- pairs with a very broad energy spectrum need to be disposed of along the post-collision line. The conceptual design of this beam line will be presented. Emphasis will be on the optimization studies of the CLIC post-collision line design with respect to the energy deposition in windows, dumps and absorbers, on the design of the luminosity monitoring for a fast feedback to the beam steering and on the background conditions for the luminosity monitoring equipment." 
}

[...]

kaplun commented 10 years ago

Originally on 2011-06-06

And more (from Savannah) [...] It seems that in these records the institution, address and number are missing: http://cdsweb.cern.ch/record/1168490/export/hx http://cdsweb.cern.ch/record/1168013/export/hx http://cdsweb.cern.ch/record/1168012/export/hx http://cdsweb.cern.ch/record/1168011/export/hx [...] In particular it nicely export data for: http://cdsweb.cern.ch/record/1113527 but not for: http://cdsweb.cern.ch/record/1168490 [...]

invenio-developers commented 10 years ago

Originally by arwagner on 2011-06-06

Other fields come to mind, especially when leaving "only journal articles". Years ago I did a quite extensive mapping from PICA format (a library catalogue) to BibTeX which I could contribute upon request. Some samples:

@BOOK{613652282,
  title = {{R}adiowave propagation: physics and applications},
  publisher = {Wiley},
  year = {2010},
  author = {Levis, Curt A. AND Johnson, Joel T. AND Teixeira, Fernando L.},
  pages = {XII, 301 S.},
  address = {Hoboken, NJ},
  isbn = {9780470542958},
  rvk = {ZN 3240 L666},
  comment = {Ill., graph. Darst.},
  ddc = {621.384/11},
  keywords = {ELT, P4, Radio / Radio wave propagation / },
  language = {eng},
  loc = {TK6565.A6},
  timestamp = {2010.11.24},
  url = {http://www.gbv.de/dms/ilmenau/toc/613652282.PDF}
}

Notice the fields sisbn (10-digit isbn, comes in handy if one wants to fetch covers ;) ddc, loc and keywords which can be populated by 082 and 6xx categories. For library materials one might probably also export the shelfmark (above in the rvk field, a common German system). URL might also point to a scanned TOC.

Also note that it might be very sensible to escape captials in {}.

Finally, a record generated from arXiv:

@ARTICLE{Aubert-2005a,
  author = {Aubert, B. and others},
  title = {{S}earch for lepton flavor violation in the decay $\tau \to \mu \gamma$},
  journal = {Physical Review Letters},
  year = {2005},
  volume = {95},
  pages = {041802},
  abstract = {A search for the nonconservation of lepton flavor number in the decay
    tau±-->µ±gamma has been performed using 2.07×108 e+e--->tau+tau-
    events produced at a center-of-mass energy near 10.58 GeV with the
    BABAR detector at the PEP-II storage ring. We find no evidence for
    a signal and set an upper limit on the branching ratio of [script
    B](tau±-->µ±gamma)<6.8×10-8 at 90% confidence level.},
  collaboration = {BABAR},
  doi = {10.1103/PhysRevLett.95.041802},
  eprint = {hep-ex/0502032},
  file = {Aubert_2005ye-eprint.pdf:Aubert_2005ye-eprint.pdf:PDF;hep-ex0502032.ps.gz:/scratch/arwagner/papers/Flavourviolation/hep-ex0502032.ps.gz:PDF},
  slaccitation # {%%CITATIONHEP-EX 0502032;%%}
}

Note also the fields for eprint, doi and slaccitation. file contains linkages to locally stored full text files (JabRef syntax).

In case, I could give additional input on the issue :)

kaplun commented 10 years ago

Originally on 2012-02-26

Moreover this is critical in CDS real use case scenarios. If a researcher is compiling his document bibliography by taking BibTeX from CDS, he currently obtains very poor export data. If he then doesn't enrich by hand the bibliography, this poor metadata will be included in his document, thus making more difficult for automatic mining tools, and cataloguers and other researchers to retrieve the intended document.

Take e.g.: http://cdsweb.cern.ch/record/1390408

Its BibTeX export currently doesn't mention any report number (such as LHCb-PROC-2011-060) (beside by chance the content of the note field which will be probably ignored by most styles), nor it mentions that the document has been presented at a conference.

@article{Callot:1390408,
      author       = "Callot, O",
      title        = "LHCb : From the detector to the first physics results",
      month        = "Oct",
      year         = "2011",
      note         = "Linked to talk LHCb-TALK-2011-176",
}
jeromecaffaro commented 10 years ago

Originally on 2012-03-29

Replying to [comment:4 skaplun]:

@article{Callot:1390408, author = "Callot, O", title = "LHCb : From the detector to the first physics results", month = "Oct", year = "2011", note = "Linked to talk LHCb-TALK-2011-176", }

A similar request came today. What could be the preferred output?

A.

@inproceedings{Callot:1390408,
  author       = "Callot, O",
  title        = "LHCb : From the detector to the first physics results",
  month        = "Oct",
  year         = "2011",
  note         = "Linked to talk LHCb-TALK-2011-176",
  crossref     = {1378086},
}
@proceedings{1378086,
  title        = "oai:cds.cern.ch:1378086. HEP-MAD 11, 5th High-Energy
                  Physics Conference in Madagascar",
  booktitle    = "oai:cds.cern.ch:1378086. HEP-MAD 11, 5th High-Energy
                  Physics Conference in Madagascar",
  year         = "2011",
  month        = "Aug"
}

B.

@inproceedings{Callot:1390408,
  author       = "Callot, O",
  title        = "LHCb : From the detector to the first physics results",
  booktitle    = "oai:cds.cern.ch:1378086. HEP-MAD 11, 5th High-Energy
                  Physics Conference in Madagascar",
  month        = "Oct",
  year         = "2011",
  note         = "Linked to talk LHCb-TALK-2011-176",
}

C.

@inproceedings{Callot:1390408,
  author       = "Callot, O",
  title        = "LHCb : From the detector to the first physics results",
  month        = "Oct",
  year         = "2011",
  note         = "Linked to talk LHCb-TALK-2011-176",
  howpublished = "oai:cds.cern.ch:1378086. HEP-MAD 11, 5th High-Energy
                  Physics Conference in Madagascar, Aug 2011",
}

Some quick comments (leaving out HOW to implement the above): A. What if a bibliography is generated out of a search query or a basket? The @proceedings might be repeated unnecessarily several times in the output, which might be an issue? B. What if month and year of the conference are different than the contribution? Shall it these be also included in booktitle. C. Same as for B. Is it semantically more/less correct?

kaplun commented 10 years ago

Originally on 2012-03-29

Well according to http://en.wikipedia.org/wiki/BibTeX#Cross-referencing this should be the way, (i.e. the first proposal). For that we might simply solve it with some pythonic hack :-) where we might introduce some closure or sort of similar thing that would not re-display twice per request the same @proceeding :-) (I can imagine a hackish format element that is aware of the request object and store there the list of already outputted proceedings).

Otherwise this hack can be generalized in an extension of bibformat where we would allow for sorts of singletons formats (that can't be outputted more than once in a request).

jirikuncar commented 9 years ago

@bouzlibop is it related to the work you have done for Zenodo BibTex formats?

bouzlibop commented 9 years ago

I've done some work in regards to BibTeX, which is now working in Zenodo. See this.

Nonetheless I'm not sure if it is sufficient when it comes about things discussed here. Some limitations of this above mentioned BibTeX formatter are eg. not being able to distinguish between phdthesis and masterthesis and then the philosophy behind is that it is trying to match against few defined entry type and when it fails to do that it chooses misc by default.

Our goal there was simple to replace the old bfe_bibtex, as it wasn't working for us - each record had the same entry type.

kaplun commented 9 years ago

Seems to me that Zenodo, CDS, INSPIRE all went into different directions in the support for BibTeX. Of course this depends very much on each having a different datamodel and focusing of slightly different document types.

At this stage, what shall we do? Is there some help coming from Jinja2 so that we can nicely create mapping from JSONAlchemy to BibTeX (with Jinja2 helping in all the escaping businness)?

Other proposals?

aw-bib commented 9 years ago

We could check into join2 implementation of this (long on my list), probably some generic ideas can be used.

lnielsen commented 9 years ago

Just my few cents: As mentioned by others the BibTeX export is highly dependent on data model and each particular record (e.g if a journal article is missing some information it might not classify as @article but @misc instead).

What is common I think is that all need serialisation of an abstract BibTeX data model. So it would be helpful if you in Invenio provided tools to easily build an a abstract BibTeX tree, and then serialise that tree into BibTeX.

aw-bib commented 9 years ago

@lnielsen: I'm not sure that you should classify a journal article due to missing fields as @misc. I'd definitely treat it as @article with missing fields.

kaplun commented 9 years ago

@aw-bib what Lars means it is that how we map to BibTeX is highly dependent on each Invenio instance. So the best we can do, indeed, is to have each Invenio instance to map metadata to an abstract pythonic BibTeX representation, and then delegating to some specialized BibTeX library to actual serialize the data into a BibTeX output (taking care of all brackets, accents, etc.).

lnielsen commented 9 years ago

@kaplun Yep, that's correct.

@aw-bib: I just meant to point out a generic non-trivial issue with BibTeX generation using @article as example. E.g. the BibTeX entry type @article have a list of required fields (author, title, journal, year, volume) and optional fields (number, pages, month, note, key). See http://en.wikipedia.org/wiki/BibTeX under "Entry types". If you create an @article type without say volume (a required field), and you run it through BibTeX it will spit out an error/warning - i.e. the generated BibTeX is not fully valid. I had users politely "complaining" about this ;-). @article might be a bad example, but it goes for all the other entry types as well, that they have required and optional fields, and if you are missing a required field in the BibTeX output, then it is not valid output and BibTeX will complain.

aw-bib commented 9 years ago

@lnielsen: Understood bibtex. And I agree: it spits out a warning. But formatting a journal article without volume as @misc would get rid you of the warning and give you a broken bibliography. So, if the record in question is an @article it should be @article regardless if the record contains all mandatory fields. Cataloguing should suggest/encourage/ensure/enforce to get all fields during ingestion, but even if they're not there you should get a warning that they are not there and not just work through without notification. (Probably, this is a philosophical question, but I don't think so.)

BTW: we handle it in join2 by modelling our document types by means of 3367_ where we store our document type (we have more than bibtex or EndNote, Citavi, Zotero, [what have you] can handle) and also add the proper mapping there (type is one of our authority thingy). This looks like:

3367_ $0PUB:(DE-HGF)16
      $2PUB:(DE-HGF)
      $aJournal Article
3367_ $00
      $2EndNote
      $aJournal Article
3367_ $2DRIVER
     $aarticle
3367_ $2BibTeX
      $aARTICLE

And then you're at a quite generic level of mapping if your Marc is right.

lnielsen commented 9 years ago

@aw-bib: Again, @article might be a bad example in this context. Take some of the other entry types and you have the same issue. Nonetheless I've still had several end-users (researchers) which uses the generated BibTeX to format their bibliography complain exactly about this issue.

Cataloguing should suggest/encourage/ensure/enforce to get all fields during ingestion, but even if they're not there you should get a warning that they are not there and not just work through without notification. (Probably, this is a philosophical question, but I don't think so.)

You should have as much metadata as possible and check and fix problems in it, agreed in principle. However, it can be a very expensive in terms of cataloging/developer man-power and depends on use cases. E.g. Zenodo is primarily relying on end-users providing the metadata.

I think your point underlines that the fact that BibTeX formatting is highly dependent on the service it is running on, and thus why I suggested providing the toolbox + very simple defaults to 1) make it as easy as possible to have bibtex export in Invenio and 2) not try to solve every possible problem that can happen when you map an instance's data model to BibTeX.

aw-bib commented 9 years ago

@lnielsen I see your point, but I think you fix it at the wrong place. IMHO it's just not working to derive document types from the existing data correctly.

I look at it the other way around. You get a user submission. (This is the same here. No cataloguers no developers. Joe Doe is submitting.) Now, if I as a user select "journal article" as document type, I, as a user, would expect that the record in question is formatted as @article in bibtex and not as @misc simply cause I did not add the volume yet. (Probably, it's "ahead of print" stuff.) Sure, bibtex will issue a warning (no error, btw.) but well, this is to be expected and it is even ok. I get a journal reference with a missing field. But better than getting no journal reference at all as in @misc...

IMHO this holds for all other document types as well.

You should have as much metadata as possible and check and fix problems in it, agreed in principle. However, it can be a very expensive in terms of cataloging/developer man-power and depends on use cases.

I do not think about manual curation after submission.

Usually, you have to rely on what the users submits. Perfectly agree. Also agreed that this might be crap. Still, I would expect the submission interface that it tells me as the user what fields would be necessary for the document type chosen to be a complete reference. Ie. that an article submission signifies that you have to add the journal. So make it bold an red or whatever.

It's an other question whether it just marks the field or whether you're enforcing it's input and completely unrelated to whether you have manpower for curation down the road. Depending on your installation you might choose either one. Again perfectly agreed. At join2 actually we have both: for internal records we signify that journal is a "must have" for articles but we don't enforce it. For records that get some upstream formal curation and end up in the bean counting of our instiutions you have to add the journal and we don't accept the record for submission unless you do so. Still both records would get formatted as @article as the user chose them to be journal articles. Ie. the bioprocessor on layer 8 told us to do so.

In your sense bibtex export in the first case would be broken cause it issues a warning. I take this as a feature: it tells me, probably much later, that I missed to add the journal. Formatting as @misc will not notify me and I'll have to scan and compare my bibliography manually and check in each case if it shouldn't have a journal entry.

I think this very same logic holds for all document types, as all that the document type signifies is the list of mandatory fields. Mandatory in the sense of the ability to produce a correct citation while correct itself is defined as "this item is retrievable and the display follows conventions for this". Ok, one could drop all the text clutter and just cite the DOI right away, would save a lot of space, perfectly agreed. Just doesn't yet follow the conventions.

Thus I think the bibtex export like all other bibliographic exports is not so dependent on your instance or use case.

lnielsen commented 9 years ago

Now, if I as a user select "journal article" as document type, I, as a user, would expect that the record in question is formatted as @article in bibtex and not as @misc simply cause I did not add the volume yet.

I can only say that my users disagrees with you. We continuously had support requests on this issue until we fixed it, and reverse, we haven't had any support requests that complained about using @misc instead of @article after we fixed it. I naturally can't say if we'll get any requests in the future.

Thus I think the bibtex export like all other bibliographic exports is not so dependent on your instance or use case.

I think we can agree to disagree on the issue then :-)

aw-bib commented 9 years ago

yepp, I fear we agree that we disagree.

Concerning your requests, I think I wouldn't file one if you format journal article as misc simply cause "this is so far of, this export just doesn't work." :(

tiborsimko commented 9 years ago

Sure, bibtex will issue a warning (no error, btw.)

Perhaps those users that use permissive tool chain see a warning and don't complain, while those users that use restrictive tool chain see an error and complain :smile:

I'm mentioning this because you can find several sources on the web saying that volume is optional, not mandatory, for @article. This contradicts the wikipedia entry cited by @lnielsen. And, as far as I'm concerned, it makes more sense to me for volume to be optional, as this nicely covers the "submitted-to" paper scenario, where the volume may not be known until the paper is published several months later.

For additional insight, let me quote from the "BibTeXing" 1988 guide by Oren Patashnik himself:

(You can find this guide somewhere under /usr/share/doc/texlive-doc/bibtex/base/btxdoc.pdf in your TeX installation.)

I therefore agree with @aw-bib that volume is really an optional field that does not justify switching @article type to @misc in our BibTeX output. (While missing journal would be another matter.)

Anyway, this discussion just shows that each site may want to configure their TeX output to follow the habits of their user community and perhaps their prevalent tool chain. Invenio-wise, we can come with good defaults in our demo site, following standards as closely as possible, all the while offering easily configurable/extendible hooks, letting admins to tweak the output further if they wish. (E.g. consider where to store URLs and abstracts and stuff that the standard does not mention. We may follow majority usage in our subject domain, but this won't necessarily suit everybody.)

aw-bib commented 9 years ago

@tiborsimko you may be right about permissiveness of the toolchain. But I think this is sort of second order, as it is only a technical issue.

For the "submitted-to" scenario people actually enter e.g. - in case we enforce the field (e.g. http://juser.fz-juelich.de/record/188781) I admit, that I don't think this is perfect and I'd rather have a better solution but it's simple and works surprisingly well.

Anyway, I think the main difference on handling exports here is that in join2 repos we prefer to trust the user and use the type supplied upon submission all the time in favour of trying to be clever.

Many of our records have actually more than one type. Thesis + Report is quite obvious, Thesis + Book as well but also proceedings published in a journal e.g. This is intended as otherwise we'd not have some 35 document types, but way more to cover all those natural double types. Guessing gets a bit wild here. From this comes the procedure that if a Thesis + Book is submitted as Thesis we export it as @phdthesis while if it is submitted as book we export it as @book. It's in the eye of the beholder what is her preference. E.g. a philosopher would more likely talk about "her first book" than a physicist, who would more likely call it his phd (never heard one talking about "her book" there). We believe that by trusting the submitters choice we are closest to the eventual users community.

lnielsen commented 9 years ago

I therefore agree with @aw-bib that volume is really an optional field that does not justify switching @article type to @misc in our BibTeX output. (While missing journal would be another matter.)

I've mentioned above that @article is a bad example because I think it side tracks the conversation. What we are talking about is the existence of required vs optional fields in BibTeX and that missing required fields in out is producing errors.

Concerning your requests, I think I wouldn't file one if you format journal article as misc simply cause "this is so far of, this export just doesn't work." :(

And why wouldn't my other users say the same about the reverse? "The service can't generate proper BibTeX output which doesn't generate errors. This is so far of that I just don't want to bother". (i.e. I don't agree with your argument).

aw-bib commented 9 years ago

See your point about the errors @lnielsen, if a missing volume would throw an error and break compilation. Usually, it's just a warning so it's not that bad. Why wouldn't I file issues? I'd have to check all doctypes in export, add missing fields, cross check, most likely in this case I'll be faster to do it myself than export first and fix it up. (IOW: you'd loose me very fast on this. To fast to file issues. But that's just a guess.)

But as I say: the main issue is not a document type or whatever but who defines the document type for export.

We use the users input decision, your procedure prefer to use existing metadata and guess what it could be. We prefer to warn the user if input is missing, you prefer not to and get clean compilation on output.

Both are valid procedures in a sense.

If I look at what you do for submission to Zenodo I see why you do the guessing. Your type Publication is almost everything (interesting that you split of poster and presentation, BTW), so your submission just don't ask the user what it is, and thus you don't even stand a chance to tell which fields should be filled in. (Actually, I still believe that this is the only reason to have document types at all.) So, at Zenodo AFAIS I can fill in all journal fields, all book fields, all theses fields and all conference fields for one single publication or I can also leave everything blank even for the mentioned proceedings paper published in a journal. So indeed you'll end up with a lot of @misc, I guess. E.g. 16697 says "book section" and instead of @inbook it gets @misc. 16696 says journal article and again gets @misc, while 16694 says journal article and gets @article and so on. This would indeed be (unusable) inconsistent in our world which has to produce the scientific report at the end of the day with a proper bibliography. Zenodo doesn't want to do that, however. (BTW: I think if it displays journal article as a badge it should export @article or display misc on the badge as well to get consistent. At the moment two different guesses get different results. Probably a ticket for Zenodo?)

So, I admit, having these extremes you're right: you can't do it via a general export and every instance will forever end up with needing programmers to adopt source code.

Eothred commented 9 years ago

Cool to see that my reported issue is still being worked on several years later, I had no idea :)

Thought I'd point out that perhaps you may find some inspiration/ideas/solutions also by looking at how desktop applications such as BibDesk and KBibtex deal with similar issues.

jirikuncar commented 9 years ago

Proposal

Step 1

Step 2

Step 3

http://nwalsh.com/tex/texhelp/bibtx-7.html

kaplun commented 8 years ago

BibTeX export is data-model specific, hence solved in each service overlay.

See e.g. an implementation from INSPIRE: