Online lookup - Githubissues

aurelien-naldi commented 2 years ago

Did you consider automatic creation of bib entries based on online lookup or should this be implemented outside of hayagriva?

I would like to integrate lookup based on

bibtex obtained from the DOI system (easy as you already have a bibtex parser)
the pubmed database of publications in life science (needs a dedicated parser)
HTML metadata in some web pages (need to extract some specific tags from the header)

I have some old python code doing this, I would be happy to port it to rust and prepare a PR

reknih commented 2 years ago

I'd like to see such a PR!

Could you gate the code and the matching dependencies behind an optional feature? For the environments we deploy hayagriva in, it would also be necessary that all dependencies you add can compile and link as WebAssembly. This means that crates depending on C code are a no-go, special attention needs to be paid to network requests. reqwest would fit the bill.

What format are the PubMed citations in? Is it RIS(-based) or something else altogether? Especially if it's a general purpose bibliography exchange format like RIS, I'd like to keep parsers separate from this crate.

aurelien-naldi commented 2 years ago

Thanks for the fast feedback!

Pubmed uses a dedicated XML format. I expect to add the following dependencies:

reqwest for network access
roxmltree or `entrez-rs to parse the pubmed format

I will work on an out of tree proof of concept, and come back here to discuss how it should be exposed in the API.

phiresky commented 1 year ago

Hey! I want to mention my project that implements this (automatic citation extraction based on an URL) for pandoc: https://phiresky.github.io/blog/2019/pandoc-url2cite/

The citation for each URL is fetched once, then indefinitely cached. The URL of the citation becomes the citation key so managing a citation database is not required.

It works by leeching off of the Zotero Translators. That's a repository containing a huge amount of parsers for all different kinds of sources. I wouldn't try starting from scratch there because the zotero translators are extensive and constantly updated. The main issue is that they need a JS runtime. There's an official docker image, but for my tool I simply use the public Wikipedia API that hosts that server: https://en.wikipedia.org/api/rest_v1/data/citation/bibtex/{url} for example https://en.wikipedia.org/api/rest_v1/data/citation/bibtex/https%3A%2F%2Fpubmed.ncbi.nlm.nih.gov%2F10885091%2F

It's not ideal to require an external server (even if it can easily be self-hosted) but it's a really easy way to get this functionality with pretty high fidelity and would work in any environment including wasm. You'll need to call an external proxy API in any case in order to do this in the browser, otherwise you won't be able to get the needed info in many cases due to CORS.

typst / hayagriva

Online lookup #28