exporting scraped PDF files

dsanson commented 12 years ago

I'd love to see support for exporting scraped PDF files. I poked around in the code a bit but couldn't make sense of what might need to be done, but I really don't know anything about the inner workings of Zotero.

Is it a matter of passing the exportFileData option when calling the bibtex.js translator? Or is it a matter of setting an increased delay, so that the export to bibtex occurs after zotero has scraped the PDF?

jawj commented 12 years ago

Thanks David. Yes, this has been requested a few times before, and would undoubtedly be handy.

I think you're on the right track -- indeed, the simplest thing might be to wait a while for a PDF to be scraped, and only then export the data, all at once. On the other hand, I think the nicer option would be to export the bibliographic data straight away, and then try to associate a scraped PDF with that data when available.

In either case, I've steered clear of adding this functionality until now just because it adds a fair bit of extra complexity. However, I'll keep this issue open, and perhaps I (or someone else) will find some time to address it in future.

vancleve commented 12 years ago

I've just started using Zot2Bib and really like it and would also like the PDF scraping ability. I also dug into the Zotero and Zot2Bib code and was left a little at a loss since I'm not really familiar with the Zotero code and the documentation isn't super detailed. I would like to help get this feature working since it would simplify my workflow and would be a boon to Bibdesk users.

I'll keep poking around in the Zotero code, but George, if you have a preliminary sense for how this might be accomplished, any general outline would be super helpful. I can get a sense for how Zotero saves PDFs and other data, but how to tie this to a function that is called on modification of a item in the library is still unclear to me. From what I understand so far, there is a single notifier that gets called when an item is modified, and this might be when its added for the first time or when a PDF is attached, and its not clear how one could tie those two events together.

Thanks for the very helpful Zot2Bib!

jawj commented 12 years ago

I think you're on the right track. The key thing is that the listener function would have to keep a note of added publications, recognise when a PDF was added to one of them, and then be able to identify that publication to BibDesk too. The listener function currently isn't stateful at all, so this is quite a big step up in complexity. Good luck with it — afraid this is still a long way down my TODO list.

vancleve commented 12 years ago

Ok, so I have a working callback that simply looks for modified items with PDF attachments and copies those attachments to another directory and opens them. Thus, all one has to do is drag the PDF onto the entry in Bibdesk and autofile does its magic. This is much better than going back to the website and finding the PDF download link. Is this of interest and if so, how should I got about contributing it? I haven't modified the preferences so that you can enter the directory into Zot2Bib through Firefox and turn the function on/off at will, but I can add that if the functionality is useful to others.

Also, I think I see a little better how to get the whole process working where Zot2Bib can add the PDF itself. In the zoteroCallback function, you check for attachments of an item that don't have an existing file (it must be downloading then). Add a field to that item containing the item.id of that attachment and don't add the entry to Bibdesk yet (I know this might be suboptimal, but its much harder to add the entry now and associate the PDF later). Have a separate callback that runs when items are modified. This callback checks to see if the item has this new special field and if the attachment specified by the field has an existing file. If so, the file is done downloading and the entry can now be added to Bibdesk and the PDF auto filed.

The problem I see with adding the bibtex entry first and trying to add the PDF later is that the user can intervene too easily with the entry while the PDF is downloading. For example, when the PDF is ready, you need some identifier in the bibtex entry that will allow Bibdesk to locate the entry to attach the PDF to. The user could accidentally modify that field though while the PDF is downloading. Attaching the PDF to the newest entry in the bibtex is also problematic since a user could add another entry before the PDF is done.

Anyway, any feedback is welcome!

jawj commented 11 years ago

Hi again @vancleve. I think the best thing would be for you to fork the repo, and make the changes you describe in your fork so I can have a look.

vancleve commented 11 years ago

Hi @jawj. I've forked your repo (here) and added my PDF scraping code. There is an additional preference for the folder to save the PDF in (default is FF download folder) and a preference for whether to open the pdf too. Right now, the whole thing has to wait while the PDF is downloading, so this is a bit annoying, but possibly unavoidable so as to not confuse BibDesk as to which reference the PDF belongs to.

Anyway, take a look and let me know what you think! I know its already been very useful to me.

vancleve commented 9 years ago

Just a quick bump on this issue since I've updated my fork again. It now continues to export the bibtex even when the PDF download fails.

foice commented 3 years ago

Well, I am considering using Zotero in placeo of BibDesk and I have to say that having the export of the bibliographic info, as it currently works, plus the location of the PDF file in the BiBDesk info would be awesome.

As far as I can see zot2bib is triggered on newly imported items and cannot be triggered to repeat the export, e.g. after that the PDF file has been fetched. Am I missing something?

I think that in general it is useful to be able to repeat (and update or overwrite) the addition to bibdesk, for instance for any zotero item that I have updated.

nathan-artist commented 3 years ago

@foice: If you plan to use Zotero in place of BibDesk (as opposed to with BibDesk), why not use Better BibTeX for Zotero? If you really plan to replace BibDesk with Zotero, you wouldn't need BibDesk at all.

foice commented 3 years ago

I have https://github.com/retorquere/zotero-better-bibtex currently "on trial". The most likely outcome seems to be I have to use it with Bibdeksk, because so far I can handle addition via command line only with Bibdesk, plus there are another number of "on the field" tests I have not made yet on Zotero. Still Zotero gives probably better keywords features (for my use).

Anyhow, this is all about "me". On the contrary I think the issue of zot2bib being triggerable at will is an issue with its own standing ... regardless of my usecase. So I restate the case for having zot2bib to be called on already existing items, why should it not be possible?

At any rate, exporting the position of the PDF file to bibdesk own field seems also a core feature.

nathan-artist commented 3 years ago

@foice: Yes, that is a major limitation of Better BibTeX for Zotero: changes to the BibTeX are one-way from Zotero. One has to close a BibDesk database before editing the BibTeX outside of BibDesk, but at least it is possible to edit it under that condition.

You may want to open a separate issue for "having zot2bib to be called on already existing items", which seems out of scope for this issue.

vancleve commented 3 years ago

Just FYI, I moved fully over to Zotero in 2019 because Bibdesk was just too slow with my big library. Zotero is quite a bit harder add functionality to, compared to the script hooks in Bibdesk, but its much more actively developed than Bibdesk.

nathan-artist commented 3 years ago

@vancleve: I don't know what slowed down your BibDesk library, but in my case I figured out quickly when I started using BibDesk in 2008 that the linked file fields were slowing it down, and that using them was not going to scale to the many thousands of references that I foresaw having in the near future. Instead, I made a "Downloaded" checkbox field and an AppleScript that opens the downloaded file without any use of the linked file field. Now I have well over 30k references in one BibDesk database and it is lightning fast, much faster than Zotero.

Having said that, there are some amazing plugins for Zotero that shouldn't be ignored, and by using Zotero in place of BibDesk you have easy access to those plugins.

vancleve commented 3 years ago

@nathan-artist, ah yes, that was probably it. I had over 10k refs with linked PDFs and it was terribly slow.

Still, there is some elegance to cutting out the zot2bib middle man and mostly just using one reference manager instead of two, even if Zotero is a bit slow.

jawj / Zot2Bib

exporting scraped PDF files #1