phiresky / pandoc-url2cite

Effortlessly and transparently add correctly styled citations to your markdown paper given only a URL
Other
122 stars 9 forks source link

Use translation-server to export directly to CSL JSON #1

Open dhimmel opened 4 years ago

dhimmel commented 4 years ago

Really awesome package. I was working on a similar pandoc-filter using the manubot python package a while ago in https://github.com/manubot/manubot/pull/99, but we never finished it.

I like your syntax for specifying citekey aliases that point to URLs. Looks like you also support some types of persistent identifiers directly in the citekey. Manubot currently supports several types of IDs for citation by persistent ID. Would be interested in coordinating to keep our syntaxes compatible.

Looks like the core of the functionality of pandoc-url2cite occurs around:

https://github.com/phiresky/pandoc-url2cite/blob/b28374a9a037a5ce1747b8567160d8dffd64177e/index.ts#L62-L76

So first you use wikipedia Citoid to create bibtex and then use pandoc-citeproc to convert to CSL JSON. Note that you can also use the translation-server API to go from Zotero metadata directly to CSL JSON (python code). Theoretically it seems possible that you could get higher quality metadata by avoiding the bibtex passthrough.

Feel free to use our public translation-server instance we host for Manubot at https://translate.manubot.org/ as described at https://github.com/manubot/manubot/issues/82. When we last checked, Citoid lagged behind translation-server... not sure if that is still the case.

phiresky commented 4 years ago

Looks like you also support some types of persistent identifiers directly in the citekey

Yep, but only doi: and isbn:. I think it's usually better to use URLs anyways though.

I like your syntax for specifying citekey aliases that point to URLs.

Yep it's supposed to be just the markdown link syntax (how pandoc parses it anywys if citations extension is turned off). In fact, I should probably also support [@abc](https://example.com).

Would be interested in coordinating to keep our syntaxes compatible.

The syntax for pandoc-url2cite citekeys is basically defined by this regex:

https://github.com/phiresky/pandoc-url2cite/blob/b28374a9a037a5ce1747b8567160d8dffd64177e/util.ts#L3

Every cite key that matches this regex is passed to Citoid, everything else will be resolved as an alias to a different cite key (or fail if unmatched). Looks like manubot has the same syntax for most ids (prefixed with doi: / isbn:), but it also adds a url: prefix to URLs.

I'd like to keep the "bare" url specifiers because for me that is the preferred syntax and I'll probably be purely using those.

Some thoughts

  1. IDs like DOIs have less special characters than URLs so they don't conflict with other markdown syntax and can be dirctly used in the [@foo] citation. I in fact also support [@http://etc] but that breaks if the URL contains a &. Maybe I'll try to ask the pandoc author again to add support for [@{https://}] as described in https://github.com/jgm/pandoc-citeproc/issues/308.
  2. IDs are shorter. Mainly useful for inline use without aliasing.
  3. I still prefer URLs because it's clear how to resolve them as a human. Every of those IDs should already have a "canonical" URL for resolving them, so imo I probably want to just use that URL. The effort of defining a citekey alias seems pretty low for me.
  4. Have you thought about declaring http: / https: the "citation source" for URLs and ://example.com the value? Then it would be compatible without the url: prefix. I mean that's kind of the point of the protocol anyways, right? I mean in theory you could try to do it "correctly" by reading all the standards of URNs / URIs and try to be compatible, but then you probably need to use urn:isbn:123 which is annoying.

Citoid to create bibtex

Yeah I was wondering why Citoid doesn't have a CSL export option since it's an obvious choice.

I've already had problems with the conversion:

So using something that can output CSL directly would be a good idea. But, I was already pretty anxious about using an external API as the resolver for multiple reasons (changes / downtime in the future, trust, etc). I really would like to use a local resolver but translation-server is far too much of a behemoth. I was really happy that I found that wikipedia provides a server since I can trust that to be able to handle traffic, not go down soon and be somewhat stable. I'll consider using the manubot server (thanks!) if I encounter more problems in the future (like as you said outdated convertors).

Thank you for the suggestion. I did not know about manubot, looks interesting. I definitely like what the output looks like, though the input might be too complicated / opinionated for me. Also tbh I'm kind of missing an overview of what manubot is, how it compares or relates to latex, pandoc and other markdown processors, what a manuscript is (is a paper a manuscript?), and if I can use it to write for existing journals (that need latex).

phiresky commented 4 years ago

Actually I just found the manubot.org homepage which explains stuff better - my fault haha. I just went through the repos, maybe link from the rootstock repo to the homepage as well?

dhimmel commented 4 years ago

Thanks for the link to https://github.com/jgm/pandoc-citeproc/issues/308. The limited character set supported by pandoc citekeys has been a big barrier for us as well. I'll chime in on the issue.

In fact, I should probably also support [@abc](https://example.com)

And this would define the @abc citekey for use everywhere? That seems like a nice syntax.

I'd like to keep the "bare" url specifiers because for me that is the preferred syntax and I'll probably be purely using those.

Makes sense. I think you make a good point that perhaps Manubot could use the http / https prefix instead of url.

I still prefer URLs because it's clear how to resolve them as a human. Every of those IDs should already have a "canonical" URL for resolving them, so imo I probably want to just use that URL. The effort of defining a citekey alias seems pretty low for me.

Yeah, requiring URLs to give all viewers the ability to immediately resolve identifiers is a nice perk. The main downside is brevity, i.e. @pmid:28936969 versus @https://www.ncbi.nlm.nih.gov/pubmed/28936969.

wikipedia provides a server since I can trust that to be able to handle traffic, not go down soon and be somewhat stable

Yes, the reliability of the Wikipedia infrastructure is a big plus. If you were to add support for translate.manubot.org, it probably would make sense as an option (perhaps off by default) or as something that if fails, falls back to the Wikipedia endpoint.

manubot, looks interesting. I definitely like what the output looks like, though the input might be too complicated / opinionated for me.

Yes, I see your filter as more general purpose. Manubot is really a toolset to continuously publish a mansucript whose source is tracked in a git repo, so github can be used for collaborative writing.

phiresky commented 4 years ago

And this would define the @abc citekey for use everywhere? That seems like a nice syntax.

Yeah. Imo it's kind of a mistake that pandoc decided to introduce a different and incompatible syntax for citekeys - But it looks to me like it can be unified back pretty well if it would add the [@x](url) and [@x]: url in the way it already works with -f markdown-citations, with slightly different semantics than links.

If you were to add support for translate.manubot.org, it probably would make sense as an option

Yep, good idea. Will probably only happen the next time i write an academic document (and have problems with it) so it might be a bit :D. Might also make sense to push citoid to support csl export. From what I understand it should only be like a one line whitelist change. I'm not even sure what citoid does apart from be a translation-server.

dhimmel commented 4 years ago

Might also make sense to push citoid to support csl export

Definitely! I'm not sure who maintains the citoid infrastructure, but keep me up to date if you have any leads.