mohnkhan / google-refine

Other
0 stars 0 forks source link

Support BibTex import into records #195

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Ed Laurent (Freebase expert) would like to be able to use Google Refine to 
import BibTex files to then load data into Freebase regarding citations & 
scholarly works.  He currently uses EndNote which he can use to export into 
XML, BibTex, and EndNote format among others. His other tools only have BibTex 
support unfortunately hence the need for this use case. (Note: Some academics 
also use Mendeley.com and Zotero Firefox plugin which also support BibTex and 
EndNote formats) (SubNote: Mendeley.com has a public api ! - darn, it's no 
longer truly public, but requires free registration.)

Original issue reported on code.google.com by thadguidry on 11 Nov 2010 at 7:38

GoogleCodeExporter commented 8 years ago
I've checked a few possible parser candidates:

JabRef - GPL

javabib - GPL

j4bib - BSD license, no recent activity, 
http://sourceforge.net/projects/j4bib/files/

bibparse - no stated license (even in source zip), author's home page hasn't 
been updated since 2005 after fairly regular updates before that, so he may be 
retired or deceased http://ftp.math.utah.edu/pub//bibparse/

I'll take a look at j4bib unless someone comes up with a better alternative.  
It's not a very complex format, so writing from scratch is an option as well.

Original comment by tfmorris on 11 Nov 2010 at 8:08

GoogleCodeExporter commented 8 years ago
Writing from scratch wouldn't be too hard at all, it's just a format of 
key:value pairs. ..especially as an importer probably wouldn't need to do 
validation.

Original comment by mcnamara.tim@gmail.com on 15 Nov 2010 at 2:10

GoogleCodeExporter commented 8 years ago
I frequently use BibTex so I give this +1!

Original comment by wfz%nimb...@gtempaccount.com on 15 Nov 2010 at 9:17

GoogleCodeExporter commented 8 years ago
attached single BibTex record from Google Books export 
[[http://books.google.com/books?id=d1tIAAAAYAAJ&pg=PR3#v=onepage&q&f=false]] 
for quality checking with diacritic characters when this feature is implemented.

Original comment by thadguidry on 19 Nov 2010 at 11:08

Attachments:

GoogleCodeExporter commented 8 years ago
I attached a more complicated record from Web of Science (first article for the 
query "google"). Note especially the multiple values in some fields. 

Google refine would be great for address cleaning and such things... Does it 
have a "address guesser"?

Original comment by jan.schu...@gmail.com on 28 Sep 2011 at 11:26

Attachments:

GoogleCodeExporter commented 8 years ago
Some additional possibilities for starting points:

bibtext2rdf Apache 2.0 license, JavaCC grammar 
  http://sourceforge.net/projects/bibtex2rdf/

ANTLR grammar for BibTex - no stated license 
  http://stackoverflow.com/questions/7583982/bibtex-grammar-for-antlr

MIT SIMILE bibtext-converter - MIT License, JavaCC grammar - doesn't attempt to 
interpret LaTex
  http://code.google.com/p/simile-widgets/source/browse/babel/trunk/converters/bibtex-converter
  https://simile.mit.edu/repository/babel/trunk/converters/bibtex-converter/

j4bib (mentioned above) - BSD license, uses JLex and CUP
  https://downloads.sourceforge.net/project/j4bib/j4bib/j4bib-0.2/j4bib-src-0.2.tar.gz

I take back what I said last year about the format being simple.  On the 
surface it is, but because one can embedded arbitrary LaTex code, you'd need a 
full parser/render to faithfully parse everything.  Even for a basic level of 
support, you'd need to handle things like LaTex character composition e.g. 
{\'E}mile 

Original comment by tfmorris on 15 Oct 2011 at 5:31

GoogleCodeExporter commented 8 years ago
If the latex thingy is a problem, maybe a RIS importer can be used, which does 
not allow latex commands.

Almost all bug databases can export RIS or bibtex and there are some bibtex to 
RIS converter, which should help if you are stuck with bibtex exports.

Original comment by jan.schu...@gmail.com on 13 Dec 2011 at 6:54

GoogleCodeExporter commented 8 years ago
Thanks for the suggestion.  The entity substitution issue that I mentioned as 
an example of LaTex processing is actually pretty simple, so we'd probably do 
that first and see how if it covers the bulk of what people need.

RIS or EndNote XML would be other bibliographic data formats to consider 
supporting for import, but I'm not sure they'd replace BibTex since many of the 
BibTex files are old hand-maintained bibliographies, not necessarily exports 
from a bib. web site or program.

Original comment by tfmorris on 13 Dec 2011 at 8:29

GoogleCodeExporter commented 8 years ago
The interesting things for biblimetricians are probably the name and address 
cleaning part. Maybe even name disambiguating: is "Chen, C" of the first work 
in the list the same "Chen, C" as in the 1245th work? Or "Meyer-Lüdenscheid, 
CW" the same as "Meyer Luedenscheid, C". Unfortunately, in the end, this is 
manual work, so I'm not sure how refine can help here. A string comparer which 
clusters names based on their string-distance function would be nice and also a 
cluster-algo based on the keywords/words in title/words in addresses (there are 
quite a few papers on Author name unambiguity, which use such methods) or the 
results of a google query (if there are similar authors and a google-query 
based on both titles returns some results, it is probably because of the 
authors webpage, which lists both works).

The name disambiguating part is probably interesting for others as well: 
merging two address databases, ...

Original comment by jan.schu...@gmail.com on 13 Dec 2011 at 9:51

GoogleCodeExporter commented 8 years ago
We're getting off-topic (at least for this issue), so we should probably move 
the discussion to the mailing list/Google Groups, but Refine excels (so to 
speak) at precisely the kind of thing you're talking about -- allowing for and 
amplifying human judgments.

Facets based on author name clusters, edit distances, keywords, and a number of 
other things are possible.  Various types of name cleanups is one of the 
current major uses of Refine.

As I said, if you want to discuss bibliographic data use cases more, let's move 
it to the list/group.

Original comment by tfmorris on 13 Dec 2011 at 10:36

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 14 Dec 2011 at 4:07