Feature Request: repeat earlier literature research and update database

rbirkelbach commented 6 years ago

Hello,

I'm not sure whether this repo is appropriate or org-ref, but here's my idea: It would be neat to save searches from for example crossref-lookup or other literature search tools and occasionally repeat them. If the new search results differ from the initial search results, one would be prompted whether these new results should be added to the literature database and whether the pdf should be downloaded if there is a pdf. This would be beneficial for ongoing long term projects or in fields where there's lots of output on a specific topic.

What do you think about this idea?

Best, Robert

jkitchin commented 6 years ago

I like the idea, and I think it is totally doable. It seems like it would either take the form of a link you click on at your leisure, or an elisp code block you run at your leisure. It isn't super obvious if it should go in scimax or org-ref, I could see it either place.

crossref is the only search engine I have written code in elisp to retrieve records, but I think that this also exists for arxiv and other backends, e.g. biblio. It should not be too difficult to put some wrappers around these to make a persistent cache to compare against. Here is an example that kind of does this via biblio. I don't know how reliable the hash method I use is. It is fragile for any changes in the records that are retrieved. It is robust to things like missing fields though.

(let* ((query "alloy segregration")
       (backend 'biblio-arxiv-backend)
       (cb (url-retrieve-synchronously (funcall backend 'url query)))
       (results (-remove 'null (loop for result in 
                     (with-current-buffer cb
                       (funcall backend 'parse-buffer))
                     collect
                     (with-temp-buffer
                       (prin1 result (current-buffer))
                       (let ((hash (md5 (buffer-string))))
                     (if (not (f-exists? hash))
                         (prog1
                         result
                           (write-file hash))
                       nil))))))

       (results-buffer (biblio--make-results-buffer (current-buffer) query backend)))
  (with-current-buffer results-buffer
    (biblio-insert-results results "")))

What do you think, is this close to what you had in mind?

jkitchin commented 6 years ago

I put a few more notes on how to do this here: http://kitchingroup.cheme.cmu.edu/blog/2018/04/11/Caching-searches-using-biblio-and-only-seeing-new-results/

rbirkelbach commented 6 years ago

Your description here and in your blog post sounds like it nails it, but I'm not proficient in elisp. I think it should be a function one could trigger via M-x some_function_name, because it would be much more end-user friendly and I would like to have it so simple that student research assistants could use it without long emacs/elisp introductions :) . The last time I searched for software with this feature I could not find anything, but that's a couple of years ago. It is very impressive how fast you implemented a prototype!

I think this would be a good feature for org-ref, as it would spread the idea faster and maybe inspire more pull requests.

Do you think it would be difficult to

implement more literature databases/search engines? For example google scholar
Update the records if the doi version nr changes, e.g. because the authors fixed some issues after someone found a major flaw?

jkitchin commented 6 years ago

The easiest UI (IMO) is to just click on a clink that shows you the new stuff. That could be in a TODO heading with a repeat deadline (e.g. once a week, etc.) so it appears in your agenda as a reminder. It can't just be an M-x function I think, unless you hard code the query into the function, or retype it all the time (which would be annoying).

That link might look something like: [[scimax-arxiv:alloy segregration]]. All the code in the blog post was really just to show all the steps to get there. It could also be: [[elisp:(scimax-arxive "alloy segregation")]]. Either way, you just click on it to see the new stuff.

I think what I did would work find for anything in biblio, which has several backends (https://github.com/cpitclaudel/biblio.el#supported-sources). There is https://github.com/cute-jumper/gscholar-bibtex, so I guess it is possible to do something similar to this for google scholar.

It would be a big effort to get it to work with something like Scopus (I did this in Python once). It might even be easier to just write a little Python script to do this via Python.

I think this would be tricky. Already, I tend to have arxiv and journal versions of articles. I think it would be sufficiently rare for a doi to change that you probably could live with multiple entries. Like when people used to cite an article and an erratum on it.

The biggest issue is still the issue at the end, which is why are there only 10 hits, and what would make a query show you newer ones? That requires digging deeper into the query to see what the limits are on how many responses are sent back, and seeing if they can be sorted, or if you can get results that are newer than some time. That doesn't seem possible with the simple query interfaces of most of these databases.

You might check out the links in org-ref-x, where x is wos, scopus, arxiv, pubmed. These open the web interface to some search terms, and from there I find it easy to add new papers from a doi.

rbirkelbach commented 6 years ago

Hah, I did not think of just using the org mode agenda and putting a link in the TODO item and put a repeated scheduler inside. That's a wonderful idea! And I'm really exited about the gscholar-bibtex package -- I didn't know it before.
Personally, I would not mind to call the Python script via emacs, as long as there are no platform dependent dependencies. Which license does it have and where can I study the code? I am just starting out learning Python and this would be a cool software project to study.
Yes, you're absolutely right about the doi version number changes.
I do not know enough about programming to help solving the issue of how to properly get all the results. However, I am more than willing to test the software that comes out of this discussion.

jkitchin commented 6 years ago

The scopus code is at https://github.com/scopus-api/scopus, it is MIT licensed and should be platform independent. Scopus on the other hand requires an institutional license. See https://github.com/scopus-api/scopus/blob/master/README.ipynb for some example uses.

rbirkelbach commented 6 years ago

Thank you very much.

jkitchin commented 6 years ago

It occurs to me that you might also want to consider getting new results through RSS feeds. I use elfeed for this. It isn't quite the same as getting new results from a search (although you can set up alerts in google scholar, scopus, etc, to send you new search results by email). It is complementary though. I have experimented with scoring and automatic tagging of rss entries (http://kitchingroup.cheme.cmu.edu/blog/2017/01/05/Scoring-elfeed-articles/), and it works pretty well. Certainly not perfect, but between email alerts and RSS, I feel like new results are pretty well covered for me. digging through old results in a new area for me on the other hand....

jkitchin commented 6 years ago

For the scopus python path, alot of the classes have some kind of org property that generates org-mode compatible strings. If they don't do what you want, they might be reasonable templates to get what you want.

jkitchin / scimax

Feature Request: repeat earlier literature research and update database #196