Citing academic papers: Crossref integration

wetneb commented 5 years ago

ReSpec has a fantastic system to cite other specs, built on top of the SpecRef database. However it is geared towards specs only: there does not seem to be a simple way to cite academic papers since the data model for the references is not adapted to them. Given ReSpec's scope, I understand the restriction to specs, but I think citing academic papers could still be useful in non-normative sections, to motivate choices behind a spec.

Crossref is a database of about 100 million papers, identified by their Digital Object Identifier (DOI). It provides an HTTP API which can be used to search for papers and retrieve their metadata, just like SpecRef does. Therefore, it should be possible to use it just like SpecRef. For instance, I would like to be able to write something like this:

… has been extensively studied [[doi:10.1007/978-3-319-93417-4_5]]. However, …

which would fetch citation metadata from Crossref and insert it in the bibliography. It would generate a citation and link to the corresponding paper at https://doi.org/10.1007/978-3-319-93417-4_5.

Would you be interested in this feature? If you think this is a good idea I could provide a pull request for that.

marcoscaceres commented 5 years ago

@wetneb, we'd totally take a PR for this! That sounds amazing. We probably want to discuss architecturally how to integrate this, but it should be fairly straight forward.

As the "doi" scheme is somewhat different to the bibref scheme, you probably want to add a matching rule here: https://github.com/w3c/respec/blob/develop/src/core/inlines.js#L30

And then check for `startsWith([["doi:"): https://github.com/w3c/respec/blob/develop/src/core/inlines.js#L246

And then add a function that writes out the reference, similar to: https://github.com/w3c/respec/blob/develop/src/core/inlines.js#L96

You can then validate the "doi:" entry there before fetching it.

And maybe also support "expansions", so [[[doi:10.1007/978-3-319-93417-4_5]]] expands out to the title of an article (or whatever is appropriate): https://github.com/w3c/respec/blob/develop/src/core/inlines.js#L31

You can probably also hook into "biblio.js", "biblio-db.js", and "render-biblio.js": https://github.com/w3c/respec/blob/develop/src/core/biblio.js https://github.com/w3c/respec/blob/develop/src/core/biblio-db.js https://github.com/w3c/respec/blob/develop/src/core/render-biblio.js

(or write equivalents to the above - specifically for doi citations)

And then change HTTP end-points if the entry starts with "doi:"... but you may need to transform from DOI format to the SpecRef format to render an entry.

marcoscaceres commented 5 years ago

Responded also on spec-prod@. Adding @tobie here as we need to decide if SpecRef is the right integration point, or if ReSpec is.

My gut feeling is that SpecRef is a better integration point. Then both ReSpec and BikeShed don't need to change at all.

wetneb commented 5 years ago

Ok! Thanks for the enthusiastic reply! (Somehow I don't seem to receive replies to my thread on spec-prod@). I am not familiar with SpecRef's architecture, but my impression was that it was quite reliant on having all the records stored in a local JSON file. For Crossref, that is clearly not an option (JSON dumps weigh gigabytes).

If we want to have a seamless integration of Crossref in there, it would mean that for each search query to SpecRef, it would itself query the Crossref API for search results and somehow merge them with local references? It would seem a bit brittle to me.

Even if Specref can be used to fetch Crossref metadata, adaptations will still be required in clients (such as ReSpec and BikeShed) for rendering, since Crossref uses a different bibliographical format.

marcoscaceres commented 5 years ago

but my impression was that it was quite reliant on having all the records stored in a local JSON file.

It's a NodeJS server application + API end-points... that just happens to store some of its data in JSON. But technically, it can serve as a proxy.

If we want to have a seamless integration of Crossref in there, it would mean that for each search query to SpecRef, it would itself query the Crossref API for search results and somehow merge them with local references? It would seem a bit brittle to me.

I was thinking "if starts with doi: then ping use Crossref, else use internal DB".

Even if Specref can be used to fetch Crossref metadata, adaptations will still be required in clients (such as ReSpec and BikeShed) for rendering, since Crossref uses a different bibliographical format.

Continuing from the above ...When Crossref responds, we convert the data to SpecRef's format (and cache it). So BikeShed and ReSpec don't need to care.

wetneb commented 5 years ago

I was thinking "if starts with doi: then ping use Crossref, else use internal DB".

That can work for retrieving metadata by identifiers, but what about the search endpoint? (When you don't know the identifier and want to find it by keywords)

When Crossref responds, we convert the data to SpecRef's format (and cache it). So BikeShed and ReSpec don't need to care.

AFAICT SpecRef's format is not really suitable to represent academic papers: it is missing fields (DOI, journal, issue, etc) and has fields that do not make sense for academic papers (status for instance). We could expand SpecRef's format, but that will require adapting the renderers too.

tobie commented 5 years ago

We could expand SpecRef's format, but that will require adapting the renderers too.

Yeah, that doesn't seem too complicated.

It's more complicated for Bikeshed, however, that pulls everything locally. @tabatkins, thoughts/interest here?

wetneb commented 5 years ago

I'm happy with the integration in SpecRef you suggest - I just thought it might be worth noting that if we go down that route, we could run into rate-limiting issues due to the fact that all requests to Crossref are going to be made via the same IP (SpecRef's server), whereas if the integration is done in ReSpec, calls to the API would be done by each client independently.

In my experience Crossref provides a reasonably reliable service, but I have no idea what sort of traffic SpecRef is exposed to.

tobie commented 5 years ago

I'm happy with the integration in SpecRef you suggest - I just thought it might be worth noting that if we go down that route, we could run into rate-limiting issues due to the fact that all requests to Crossref are going to be made via the same IP (SpecRef's server), whereas if the integration is done in ReSpec, calls to the API would be done by each client independently.

Specref could fairly easily cache those resources in memory for a day.

In my experience Crossref provides a reasonably reliable service, but I have no idea what sort of traffic SpecRef is exposed to.

Light.

marcoscaceres commented 5 years ago

That can work for retrieving metadata by identifiers, but what about the search endpoint? (When you don't know the identifier and want to find it by keywords)

Wouldn't one just go to search.crossref.org to get that?

... but that will require adapting the renderers too.

As @tobie said, that's pretty trivial. The renderer is tiny: https://github.com/w3c/respec/blob/develop/src/core/render-biblio.js#L191-L219

Writing an alternative renderer would also be pretty simple.

I'm happy with the integration in SpecRef you suggest - I just thought it might be worth noting that if we go down that route, we could run into rate-limiting issues due to the fact that all requests to Crossref are going to be made via the same IP (SpecRef's server), whereas if the integration is done in ReSpec, calls to the API would be done by each client independently.

If it becomes widely used, we can also look at actually paying the fee to get more request. I can't figure out if it's US$250 or US$2000... but anyway, the fees are not outrageous and we might be able to ask around for funding.

wetneb commented 5 years ago

Ok, I agree caching should solve the issue here since you can expect that only few papers are going to be cited in all specs developed with ReSpec / Bikeshed.

One issue with my proposal is that DOIs are generally not very human readable so it might not be very convenient as a writer to keep track of what you are citing, when editing the specs. But I guess this is also true of some other SpecRef ids (such as IETF ones I think) and we could always come up later on with a local alias system, client-side.

marcoscaceres commented 5 years ago

Ok, closing this as it seems we have rough consensus that this should be a SpecRef feature. Let's spin up a new bug there if we want to discuss implementation strategy. @wetneb, we can discuss there an implementation strategy.

speced / respec

Citing academic papers: Crossref integration #2568