Closed dsanson closed 13 years ago
For bibtex, it looks like I might be able to implement this using the bibtex-ruby gem. I can also make it search not just for citekeys but for any keywords, so that
@rabbit
will suggest any articles containing the word 'rabbit' in any fields. The parsing is easy. I still struggle getting results in ruby back to vimscript.
Check the changes I've made on the multibibs branch: I'm supporting the dictionary format for completion.
I've rewritten Pandoc_bibkey in python, but that was just because I'm more used to it, and can be rewritten in ruby again if needed. (Actually, the key scanning was much cleaner in ruby. Damn python's lack of non-fixed regex look behinds!)
EDIT: I found a way to simplify the regexes. The python code should be much clearer now.
Grrr! You are pythonifiying everything. I'm much more comfortable in ruby. Of course, you're also making everything better, so I can't really complain ;-)
Sorry for that! ;) But you were rubyfying everything first!
BTW: what do you think is best for sorting the completion items: sorting by key or by title?
I think I've tested the new features enough. If there are no comments, I'll merge the multibibs branch into master.
Not working for me. When I try to complete a key, vim hangs for several seconds, then dumps a load of error messages.
As for sorting: I would support sorting by citekey, not title. Typically when I want to enter a citation, I know the author for certain, know the title more or less, and may or may not remember the year (and my citekeys are authoryearxx). And if we sort by title, then we have to worry about whether or not to ignore leading articles like 'the' and 'an' and the like.
OK about the sorting part.
That error seems to be a problem with pybtex. What version do you have? I have 0.15 here.
I pushed some changes, so the plugin will resort to the fallback procedure if pybtex fails. Can you check if it solves the problem?
pybtex 0.15. I just installed it via pip, and haven't tried to use it aside from this. So I'll test that and get back to you.
As for the changes: yes, it no longer throws a bunch of errors. But it still hangs (for about 7 seconds) before it gives up and provides the fallback completion. And it does this every time. (By contrast, when I didn't have pybtex installed at all, on the previous version, it immediately completed via the regex.)
Turns out I had a problem with my bibtex file: an unquoted journal title. So now it works, but it is slow. 7 seconds before the matches pop up. To be fair, I have a little over 1000 items in my database, but this is too slow to be useable.
I noticed that you are putting every matched bibfile into b:pandoc_bibfiles. And you are putting every bibfile in the working directory in there too. I don't think this is the right way to go. For one thing, it will only slow things down more (though my 7 seconds is after manually setting b:pandoc_bibfiles to only include the one file). But also, it gives me too little control. If I have a bib file with the same file name in my directory, that's probably the file I want. If I don't have a bibfile with the same file name the document directory, but I do have some other bibfile in that directory, that's probably the file I want. And if I've put a different bibfile at .pandoc/default.bib, that probably means I want to use it instead of the one in my texmf folder.
I tried completions just against the sample bibtex file from the link above, and they didn't all work. I've gisted a modified version of that file here. It looks to me that the problem is that you aren't matching citekeys that start with an uppercase letter, like
Zurek:1993
Primes
The others seem to work.
Also, can we get 'menu': '
But these details may not matter if there isn't a way to speed things up.
I removed the pybtex dependent code and modified the procedure. Are you testing over those changes? There was a bug where the procedure only matched entries where the Title tag was uppercase. So
@Artitle{Bounjour,
Title = {In defense of pure reason},
...
}
would be matched, but not
@Article{Bonjour,
title = {...},
...
}
That is probably the reason 'menu' didn't show up in the completion. Currently, it is the title.
I didn't modify much the code that detects the bibliographies, so it doesn't stop once it has found a suitable bibliography in a certain path. I'll check it out so it behaves better. I think you're right, except that if the user has several bibfiles in the working directory, we can assume that he wants to use them all for the current document.
except that if the user has several bibfiles in the working directory, we can assume that he wants to use them all for the current document.
I don't think so. Consider something like
papers
on_what_there_is.markdown
on_what_there_is.bib
two_dogmas.markdown
two_dogmas.bib
It seems pretty clear which bibs go with which files. But if none of the bibfiles match the filename, e.g.,
something.markdown
anotherthing.markdown
epistemology.bib
metaphysics.bib
misc.bib
then you are right: they should all be used. I'm not sure what to think, though, about someone who happens to have json file in the same directory, but it isn't a bibfile....
Hm... I think the procedure should match:
1) any *.bib
,*.mods
,*.json
,*ris
files named as the current file. If succesful, stop.
2) any *.bib
,*.mods
,*.json
,*.ris
in the current folder. If sucessful, stop.
3) any file named default.{bib,ris,mods,json} in the local pandoc data folder. If succesful, stop
4) any bibliography file in texmf.
We should give the option to exclude some files if wanted. For example, I would forbid vim-pandoc to search for bibliographies in texmf.
About the JSON issue: there is no quick way to determine whether a .json file is a bibliography, really. We could create some parser to determine if it is structured as a bibliography, but that seems to be overkill. Besides, what use case do you imagine where someone has .json files in the current folder for something else than this?
Okay. I guess I was a commit behind. I checked out the latest code in the multibibs branch, and pybtex is now gone.
Still no titles on my 1000+ entry bibtex file. But it works using a file that just contains two entries copied from that bibtex file. I'll see if I can isolate the problem.
I didn't modify much the code that detects the bibliographies, so it doesn't stop once it has found a suitable bibliography in a certain path.
Right. The old code was written so the last detected bibliography would override all the others.
Okay. This was obvious enough. My bibtex file didn't meet the test:
if len(scanned_titles) == len(scanned_labels):
When I commented that bit out, everything worked great.
Why are you testing for that? Not every bibtex entry needs to have a title. Some have a booktitle instead. But some might have no title at all. Note that a similar issue arises for authors: not every entry will have an author. Some will have an editor, but some will have neither. This is why it would be so much easier if we could use pybtex or citeproc-hs to do the heavy lifting for us....
So in my perfect world, the 'menu' portion of the completion would return
Name, Title
where Name
is the last name of the first author, or, if that doesn't exist, last name of the first editor, or, if that doesn't exist, no name is returned; and Title
is the title (perhaps the first n words of the title for some n), or, if that doesn't exist, is Booktitle
, or, if that doesn't exist, is the year; or, if that doesn't exist, is empty.
I'm testing for that because otherwise the titles will be misaligned. In the ideal world we could depend on pybtex, but it's choking on your system.
Hm... I think the procedure should match:
1) any .bib,.mods,_.json,ris files named as the current file. If succesful, stop. 2) any .bib,.mods,.json,_.ris in the current folder. If sucessful, stop. 3) any file named default.{bib,ris,mods,json} in the local pandoc data folder. If succesful, stop 4) any bibliography file in texmf.
We should give the option to exclude some files if wanted. For example, I would forbid vim-pandoc to search for bibliographies in texmf.
Sounds fine to me.
About the JSON issue: there is no quick way to determine whether a .json file is a bibliography, really.
Agreed. I don't think we should try to do this.
Besides, what use case do you imagine where someone has .json files in the current folder for something else than this?
Well, JSON can be used for lots of things. I use Jekyll, and that means I have markdown files sharing folders with YAML files that are used to configure how jekyll behaves. It doesn't seem a stretch that someone might have a JSON file in the same folder as a markdown folder. (Perhaps it is a bit more of a stretch to imagine this in the case of a working draft of an academic paper, but I do use citation completion sometimes for webpages too.)
Also, pandoc can output JSON. So someone might be working on foo.markdown and have a pandoc-generated json copy at foo.json, if they had some target that took advantage of pandoc's json output.
Not that I can see anything we can do about this.
In the ideal world we could depend on pybtex, but it's choking on your system.
Its not choking anymore, just slow. Is it fast on your system? Have you tested it against a large bibtex file?
I'm testing for that because otherwise the titles will be misaligned.
I see. I hadn't looked closely at how you were doing it. I don't think it can be done this way, because we can't expect every entry to have a title.
I don't have large bibtex files around, sadly (part of the problem?)
I don't think it can be done this way, because we can't expect every entry to have a title.
In that case, we can't expect any of the more powerful completions to be reliable as they are now, and we should drop them.
My bibtex file is available here if you want to play around with it. I just tested it with the minimal version (sanson-min.bib) after reverting to 9a3a999. It takes maybe 5 seconds to complete a citation here on a two year old MacBook Pro running Lion.
If you want a real monster, you could try this.
Here is a rudimentary ruby script that returns vim dictionary style results using bibtex-ruby. It is not blazing fast either--maybe 2 or 3 seconds.
I suppose we could offer powerful completions based upon pybtex along with an option for turning them off.
But what about MODS, RIS and JSON files? The implementations we have are naive.
While we research a way to handle this, I have dropped the complex completions, so we can merge the multibibs branch without bringing those issues into master.
That seems right: multibibs support is clearly distinct from making the completion function smarter.
I created a "smartbibs" branch that is on commit 9a3a999 --- the last commit before you removed pybtex. This is the one that works for me, but is slow. We'll have to merge in upstream changes eventually, if we decide to use it.
I think there's no need of keeping that branch separate, since we have the history of changes. It's likely that there will be changes in the completion code anyway, so merging that old code with whatever we have when we go back to this probably won't be smooth.
More thoughts about this.
bibtool is very fast, and can extract a set of bibtex entries based upon a regex, e.g.,
bibtool -X "geach" big.bib -o small.bib
or, if you just want to search the citekeys,
bibtool -- 'select{$key "geach"}' big.bib -o small.bib
or to search selected fields,
bibtool -- 'select{title booktitle author editor $key "geach"}' big.bib -o small.bib
It has a bunch of other options that allow several input files, control sorting, detect duplicates, etc.
So we could use bibtool to get a small.bib file that contains exactly the entries we want to offer for completion, and then use pybtex or bibtex-ruby to parse that file for key, author, title.
Parsing bibtex properly with regexes requires recursive matching of paired brackets. But a much cruder strategy is to just look for lines that start with "@", and assume that everything between "@"s is a single entry. This isn't quite right (there can be stuff between bibtex entries that bibtex is supposed to ignore), but it might be close enough. Once we have an array of chunks of text between "@"s that match a given regex, we ought to be able to use regexes to search for citekey, title, author, booktitle, and editor within those chunks.
I'm in favor of the sloppy regex solution. I've tested the approach (https://gist.github.com/1203698), and it is much faster than what we had: for sanson.bib, parsing the file takes around 0.08 seconds. For philosophy.bib, it takes around 1.1 seconds.
I think we should only retrieve titles for the value of menu
. First, because the less regex searches we make, the faster it goes. The procedure I have (see the gist) takes ~0.01 seconds to traverse "sanson.bib" when the query is "lew" (which gives 68 results). Second, because if the ids are formatted in authoryear format, author info is redundant. Third, because I think that editor information is (sadly) never something one needs to know.
Neat!
For comparison purposes, here is a test of the bibtool approach, providing author (or editor), title (or booktitle) (and doing some work to clean up titles):
https://gist.github.com/1203906
On my system, running this inside of time
on 'lew' gets me:
real 0m0.423s
user 0m0.338s
sys 0m0.071s
While running yours on 'lew' gets me:
real 0m0.084s
user 0m0.059s
sys 0m0.020s
So yes, the sloppy method is faster. But the bibtool method is pretty fast too.
Oops. Just realized that your version parses the bibfile twice. So those numbers above are about twice what they should be.
bibtool's contribution to the time:
real 0m0.077s
user 0m0.071s
sys 0m0.005s
In fact, if I put an exit command in the ruby script right after the require 'bibtex'
line, I get
real 0m0.272s
user 0m0.199s
sys 0m0.057s
So almost all of the time taken by the script is taken up loading the gem.
if the ids are formatted in authoryear format, author info is redundant.
True. But if not?
Third, because I think that editor information is (sadly) never something one needs to know.
Editor matters when you are citing a collection (rather than something in a collection), or a specific edition of a classic. In the first case, presumably if your cite keys are author:year, the citekey will be editor:year. The second case isn't something we'd be supporting anyway, since in that case, editor would be trumped by the author.
These choices make little difference to total time on the bibtool/bibtex-ruby or bibtool/pybtex approach, but I can see that they matter on the regex approach.
One thing I like about the regex approach here: no external dependencies.
More data points. (Note that the name for my test script is bibvim, and I've modified it from the gist so that it just outputs the number of entries, rather than a string representation of them.)
Running against sanson.bib:
$ time bibvim lew
78
real 0m0.410s
user 0m0.337s
sys 0m0.070s
Note that I am getting 78 hits. Presumably that's because I'm searching for citekey, author, title, editor, booktitle. I played around with these options to bibtool, but it made no appreciable difference to the total time taken by the script.
Running against philosophy.bib
$ time bibvim lew
107
real 0m0.716s
user 0m0.634s
sys 0m0.078s
My sense is that this approach will scale well (though really, do we need to worry about bibfiles any bigger than this?) The main performance penalty is loading the gem. After that, things are quite speedy.
Actually, my script runs the procedure thrice. The correct output of time for sanson.bib is:
real 0.06
user 0.05
sys 0.01
and
real 0.10
user 0.08
sys 0.02
for Philosophy.bib
Fast indeed. Am I right that you are only searching for matches in citekeys?
I'm convinced that we should go your way. If shortcomings arise, we can always revisit the bibtool/bibtex-ruby or bibtool/pybtex solution.
Looks like a similar process should work fine for RIS (split by /^ER -/
) and MODS (split by /<mods>/
. Not sure about JSON.
Yes, I am only searching for matches in citekeys.
I'm working on searching for matches in other tags too. I'm trying to plug bibtool into the regex mini parser now.
I'm sure a similar process can work for RIS (I made a parser for it yesterday night). For MODS, I would prefer to use a proper XML parser. For JSON we should use a parser too; python's is very fast.
I plugged the regex procedure with bibtool: https://gist.github.com/1204775
For sanson.bib:
real 0.20
user 0.17
sys 0.02
For Philosophy.bib:
real 0.70
user 0.66
sys 0.04
This is searching for the query on $key, title, booktitle, author and editor.
You might want to check this out: http://www.youtube.com/watch?v=0ux6koT-U_U (preferably in 720 and full screen).
That is pretty sweet, sir. I am impressed.
I pushed the new completions code into new-completions, and deleted the smartbibs branch. The changes include some file reorganization (the methods that handle bibliographic suggestions are now in autoload/pandocbib.vim). It needs some cleaning (there's some code duplication), but it works. I have experimental support for using bibtool too.
Good stuff.
With g:pandoc_use_bibtool set, I get errors. For example, working against sanson.bib, @lew<C-X><C-O>
gets me:
Error detected while processing function pandoc#Pandoc_Complete..pandocbib#Pando
cBibSuggestions:
line 30:
Traceback (most recent call last):
Error detected while processing function pandoc#Pandoc_Complete..pandocbib#Pando
cBibSuggestions:
line 30:
File "<string>", line 20, in <module>
Error detected while processing function pandoc#Pandoc_Complete..pandocbib#Pando
cBibSuggestions:
line 30:
File "<string>", line 89, in pandoc_get_bibtool_suggestions
Error detected while processing function pandoc#Pandoc_Complete..pandocbib#Pando
cBibSuggestions:
line 30:
IndexError: no such group
Error detected while processing function pandoc#Pandoc_Complete:
line 22:
E706: Variable type mismatch for: suggestions
The same occurs for @l
, @le
... @lewis
. But @lewis1
suddenly works. Likewise, @h
... @hinc
get the error, but @hinch
works...
I confirm. For the bibtool code I mostly copied what I had in the standalone script, so it is essentialy a stub.
I just fixed that issue on commit 69c5e77f5d7730f35651c05c9ae6602ef265dfed.
I'm having a problem where it won't retrieve some names and titles. What's weird about it is that the code for that is the same as the one I'm using in the non-bibtool-based method, which works fine.
By the way, I am coming to agree that we shouldn't show the author in the 'menu' part. I like it when it is just the last name of the first author, but when it becomes the full name of multiple authors, its too much. And I don't think you should be trying to parse bibtex name fields to find the last name of the first author (ugh).
Do you have examples of cases in which names and titles aren't working for you?
In sanson.bib, for example, "cohen2005" doesn't retrieve neither the name nor the title. There are many examples, even if you just try to complete after the @.
An yes, names are complicated for those reasons.
The common thread is that these are all cases where the author or the title, once processed by bibtool, has a linebreak within it. For example,
bibtool -- 'select{title booktitle author editor $key "cohen"}' ~/.pandoc/default.bib
Gets you (trimming away all the Bibdesk crap):
@Book{ cohen2005,
author = {Cohen, S. Marc and Curd, Patricia and Reeve, C. D. C. and
Cohen, S Marc},
date-added = {2008-02-12 23:20:43 -0500},
date-modified = {2010-11-03 17:19:28 -0400},
edition = {3rd},
isbn = {0872207692},
pages = {958},
publisher = {Hackett Publishing},
title = {Readings in Ancient Greek Philosophy: From Thales to
Aristotle},
year = {2005},
bdsk-url-1 = {http://books.google.com/books?id=XVHj_gwk39QC}
}
Here both title and author contain a linebreak. copelston1953 has an author but no title. Testing it we see that
bibtool -- 'select{title booktitle author editor $key "copleston1953"}' ~/.pandoc/default.bib
(again trimming the crap):
@Book{ copleston1953,
address = {New York},
author = {Copleston, Frederick},
date-added = {2007-11-28 20:49:38 -0500},
date-modified = {2010-11-03 17:19:28 -0400},
number = {1},
publisher = {Newman},
series = {A History of Philosophy},
title = {Late Mediaeval and Renaissance Philosophy: Ockham to the
Speculative Mystics},
volume = {3},
year = {1953},
bdsk-file-1 = {YnBsaXN0MDDUAQIDBAUIJidUJHRvcFgkb2JqZWN0c1gkdmVyc2lvblkkYXJjaGl2ZXLRBgdUcm9vdIABqAkKFRYXGyIjVSRudWxs0wsMDQ4RFFpOUy5vYmplY3RzV05TLmtleXNWJGNsYXNzog8QgASABqISE4ACgAOAB1lhbGlhc0RhdGFccmVsYXRpdmVQYXRo0hgNGRpXTlMuZGF0YU8RAaAAAAAAAaAAAgAACU1hY2ludG9zaAAAAAAAAAAAAAAAAAAAAAAAAMeJGgxIKwAAADKayRJjb3BsZXN0b24xOTUzLmh0bWwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB2r/wqRp2gAAAAAAAAAAAAEAAgAACSAAAAAAAAAAAAAAAAAAAAADYmliAAAQAAgAAMeJYFwAAAARAAgAAMKkohoAAAABABQAMprJADKavgAFujEABbokAACRoQACAENNYWNpbnRvc2g6VXNlcnM6AGRhdmlkOgBEb2N1bWVudHM6AERyb3Bib3g6AGJpYjoAY29wbGVzdG9uMTk1My5odG1sAAAOACYAEgBjAG8AcABsAGUAcwB0AG8AbgAxADkANQAzAC4AaAB0AG0AbAAPABQACQBNAGEAYwBpAG4AdABvAHMAaAASADRVc2Vycy9kYXZpZC9Eb2N1bWVudHMvRHJvcGJveC9iaWIvY29wbGVzdG9uMTk1My5odG1sABMAAS8AABUAAgAM//8AAIAF0hwdHh9YJGNsYXNzZXNaJGNsYXNzbmFtZaMfICFdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3RfEBJjb3BsZXN0b24xOTUzLmh0bWzSHB0kJaIlIVxOU0RpY3Rpb25hcnkSAAGGoF8QD05TS2V5ZWRBcmNoaXZlcgAIABEAFgAfACgAMgA1ADoAPABFAEsAUgBdAGUAbABvAHEAcwB2AHgAegB8AIYAkwCYAKACRAJGAksCVAJfAmMCcQJ4AoEClgKbAp4CqwKwAAAAAAAAAgEAAAAAAAAAKAAAAAAAAAAAAAAAAAAAAsI=}
,
bdsk-url-1 = {http://books.google.com/books?id=m3ItKgAACAAJ}
}
So I'm pretty sure its the line breaks that are causing you trouble. You could match the whole title remove the linebreaks. Or you could just match to the end of the line. This would in effect give us a quick and easy way to truncate titles to a reasonable length, assuming bibtool does this in a consistent way.
There are various settings that can be used to fine tune bibtool's ouptut. See p. 22 of the manual.
Hm... I need to fix the regex, then.
I think we should truncate titles in postprocessing, because a naive method won't be ideal. Think in the case where the titles of two entries (let's say, "dude1955a" and "dude1955b") are very long but only differ in the last words ("part 1", "part 2") (I'm sure we've all seen titles like that IRL). I would prefer removing text in the middle, but that might look wrong. Detecting those problematic cases could slow us down (we have plenty of room, I think, though).
We could also try to forbid bibtool to break lines when it reaches print.line.length
(actually, by setting it to a very large number). However, that won't help if we are not using bibtool, where we can find this problem too.
Vim allows completion functions to return a list of dictionaries, rather than just a list words. The simplest format is
So, for our purposes, it might be:
were the value of 'word' is the citekey that will be inserted, and the value of 'menu' is what shows up in the popup menu.
In order to implement this, I need smarter parsing of the supported bibliography files. This can either be implemented directly via regexs, or we can lean on existing parsers, if they are available. The advantage of using regexes is that it is light-weight---we avoid introducing new dependencies---and they probably work fine.
Another option---inspired by vim.latex-box's use of latex+bibtex to solve a similar problem---would be to collect matching keys and then use pandoc to generate a plaintext bibliography and then parse that. The trouble is that the usual CSL styles don't include the citekey. It might not be too hard to generate a custom CSL file for this purpose. But the process is probably too slow for something like completion.