Crossref integration - Githubissues

tobie / specref

An open-source, community-maintained database of Web standards & related references.

http://www.specref.org/

Apache License 2.0

165 stars 142 forks source link

Crossref integration #568

Closed wetneb closed 5 years ago

wetneb commented 5 years ago

Citing the original issue in https://github.com/w3c/respec/issues/2568:

Crossref is a database of about 100 million papers, identified by their Digital Object Identifier (DOI). It provides an HTTP API which can be used to search for papers and retrieve their metadata, just like SpecRef does. Therefore, it should be possible to use it just like SpecRef. For instance, I would like to be able to write something like this:
… has been extensively studied [[doi:10.1007/978-3-319-93417-4_5]]. However, …
which would fetch citation metadata from Crossref and insert it in the bibliography. It would generate a citation and link to the corresponding paper at https://doi.org/10.1007/978-3-319-93417-4_5.

There seems to be consensus that SpecRef itself should act as a proxy between clients (such as ReSpec, Bikeshed) and Crossref. In other words, it should be possible to retrieve DOI metadata via the SpecRef API.

In other words it should be possible to do something like this:

https://api.specref.org/bibrefs?refs=FileAPI,rfc2119,doi:10.1007/978-3-319-93417-4_5

and get as a response:

{
   "FileAPI": { … },
   "rfc2119": { … },
   "doi:10.1007/978-3-319-93417-4_5": {
        … metadata returned by Crossref …
   }
}

Problem A: Crossref's metadata format is different from SpecRef's. I see a few options:

translating it to something similar to SpecRef's current format, adding some missing fields such as the journal or conference.
returning Crossref's metadata as-is. This would probably mean adding a field to both SpecRef and Crossref records to indicate the format. For instance, "$schema":"URI of a JSON schema describing the format of this record". Or any other syntax.

Option 1. means SpecRef and Crossref records will have common fields (such as title, authors) that consumers can rely on without changing their renderer much. But it means we will be discarding information (for instance, ORCID ids for authors cannot be represented easily) and adding a potentially complex logic in SpecRef to translate from one format to another (which will need maintaining as the formats evolve).

Option 2. basically forces clients to use different renderers for each format. But that lets clients handle Crossref references using all the information available in Crossref's metadata, which can be useful. And it is probably easier to maintain on SpecRef's side as the code remains independent of any format change on Crossref's side.

Problem B: Architecturally speaking, how do we integrate this in SpecRef? Do we want to make it possible to integrate other bibliographic databases in the same way (so, coming up with a modular design)? What should it look like?

tobie commented 5 years ago

I’d like to get feedback from @tabatkins on this as it essentially going to break Bikeshed’s ability to fully work offline.

How many papers are we expecting people will actually reference?

Have we considered instead creating a really simple solution (UI?) to add those to the specref repo itself?

tobie commented 5 years ago

Additionally what specific bit of data that’s missing from Specref do you need to reference these studies in specs? Could you give us an example of how you want them referenced?

wetneb commented 5 years ago

Sure! Here is an example of Crossref metadata for a sample DOI. The following fields can be useful to render academic references:

Type of document ("type") to determine how to render it
DOI ("DOI")
ISBN ("ISBN") for articles published as book chapters, see this example
ISSN ("ISSN") for articles published in journals
Journal or book name ("container-title")
Book or journal editors ("editor")
Subtitle ("subtitle") as some publishers sometimes split titles into two and the second part is stored there;
Publication date ("issued") which can have various granularity (in the example above it only provides year and month, not day)
Page numbers ("page")
Issue numbers ("issue")
ORCID ids for authors (in the "author" object, can be seen in this example)
Potentially the publisher ("publisher"), although this is often omitted
Potentially the publisher location ("publisher-location"), although this is also often omitted.

Clients might want to render such references using any bibliographical style, here are a few examples for this article (obviously GitHub's markdown does not let me add all the typographical tweaks so that they look nice, but you would be able to do that in clients):

Vancouver style with links on authors (via ORCID):

Beek W, Raad J, Wielemaker J, van Harmelen F. sameAs.cc: The Closure of 500M owl:sameAs Statements. Lecture Notes in Computer Science [Internet]. Springer International Publishing; 2018;65–80. doi:10.1007/978-3-319-93417-4_5

Harvard style, more compact:

Beek, W. et al., 2018. sameAs.cc: The Closure of 500M owl:sameAs Statements. Lecture Notes in Computer Science, pp.65–80. Available at: http://dx.doi.org/10.1007/978-3-319-93417-4_5.

APA style (note that this requires being able to extract initials for first names, so first and last names should ideally be stored as separate fields):

Beek, W., Raad, J., Wielemaker, J., & van Harmelen, F. (2018). sameAs.cc: The Closure of 500M owl:sameAs Statements. Lecture Notes in Computer Science, 65–80. doi:10.1007/978-3-319-93417-4_5

Wikipedia style, more verbose (note the inclusion of the editors, which are not actually available in Crossref's metadata in this particular example):

Beek, Wouter; Raad, Joe; Wielemaker, Jan; van Harmelen, Frank (2018). "sameAs.cc: The Closure of 500M owl:sameAs Statements". In Aldo Gangemi, Roberto Navigli, Maria-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna Tordai, Mehwish Alam (eds.). The Semantic Web. 10843. Cham: Springer International Publishing. pp. 65–80. ISBN 978-3-319-93416-7. DOI 10.1007/978-3-319-93417-4_5. Retrieved 2019-11-11.

You could also want to render the citation in machine-readable format using the COinS standard, where identifiers (DOI, ISBN, ISSN, ORCID) become super useful too.

I would expect people could reference a few dozen papers per spec, so perhaps a few thousands / tens of thousands would end up in SpecRef's cache in the coming years, depending on usage?

About a simple UI solution to push these in SpecRef's refs/biblio.json, why not… but wouldn't it be a bit dirty to store these there? If you are happy with that I am clearly not going to stand in the way of course.

If this feature is going to break Bikeshed's isolation from the network, then I would be keen to go back to my initial proposal to do this client-side in ReSpec directly. But it's up to you :)

tobie commented 5 years ago

Clients might want to render such references using any bibliographical style

Could you come up with something that's close enough to what Bikeshed and ReSpec already do and specify what data would be needed to render those?

From the above it feels like there actually isn't a lot missing from SpecRef's data model for this.

About a simple UI solution to push these in SpecRef's refs/biblio.json, why not… but wouldn't it be a bit dirty to store these there? If you are happy with that I am clearly not going to stand in the way of course.

Well, we'd give its own biblio file. Automate updates to it, etc. I'm more concerned about the licensing terms, here, which are unclear. Do you know what the DB is licensed under?

If this feature is going to break Bikeshed's isolation from the network, then I would be keen to go back to my initial proposal to do this client-side in ReSpec directly. But it's up to you :)

That would hurt spec portability. Adding them in a cache like I'm suggesting solves that problem.

wetneb commented 5 years ago

Could you come up with something that's close enough to what Bikeshed and ReSpec already do and specify what data would be needed to render those?

My gut feeling is that it would not hurt to include all the fields mentioned above, so that our hands are not tied to a particular rendering style. If you want to store this metadata in a separate file, then I'd advocate for keeping Crossref's original format directly (perhaps only removing the citations which can take a lot of space and are clearly not going to be useful to render references). I don't see the value in coming up with our own metadata format derived from theirs', where we remove metadata fields just because we don't want to render them right now. But I am perhaps missing some constraints on your side?

From the above it feels like there actually isn't a lot missing from SpecRef's data model for this.

Aren't most of the fields mentioned above missing?

Well, we'd give its own biblio file. Automate updates to it, etc. I'm more concerned about the licensing terms, here, which are unclear. Do you know what the DB is licensed under?

As metadata this can be freely copied and redistributed: "Crossref asserts no claims of ownership to individual items of bibliographic metadata and associated Digital Object Identifiers (DOIs) acquired through the use of the Crossref Free Services. Individual items of bibliographic metadata and associated DOIs may be cached and incorporated into the user's content and systems."

tobie commented 5 years ago

Aren't most of the fields mentioned above missing?

No. See JSON schemas.

My gut feeling is that it would not hurt to include all the fields mentioned above, so that our hands are not tied to a particular rendering style.

Given limited resources to implement and maintain this, I feel quite the opposite: agree on a particular rendering style that’s as close as possible to what’s existing. Add just the extra fields that are necessary (so not splitting last names and first names, for example), and combine things like title and subtitle in a single field.

If one day we magically have more resources and want a different style, busting that cache will be trivial.

wetneb commented 5 years ago

OK! It looks like I don't fully understand what setting up this cache involves for you, since I was expecting it to be less maintenance work to reuse their existing format, rather than having to maintain code that does the translation. I am not going to stand in the way and let you do it how you think it fits the platform best :)

wetneb commented 5 years ago

A proof of concept can be found here: https://github.com/w3c/respec/pull/2569

@tobie, that demonstrates a possible rendering method, which uses many of the fields mentioned above. (But then again, this renderer is probably going to evolve as it gets used on various examples, so having to decide on a rendering style before deciding how the medadata is stored feels a bit like putting the carriage before the horse to me). I hope it helps! Anyway, that solution works for me.

tobie commented 5 years ago

Well, the whole point of Specref is to offer as common a data format as possible for all of its references, with the explicit goal to focus on spec edition use cases (and related tools).

So what would be ideal is providing a mapping between the two so that the API users have as little to change as possible in their code.

wetneb commented 5 years ago

Sure! I don't feel like I can do these design decisions concerning your format, so I will leave that to anyone who wants to push this further.

tobie commented 5 years ago

So I assumed from our earlier exchanges that you would be interested to submit a PR for this. To set expectations, I don’t imagine I’ll find the time to work on this myself in the foreseeable future. So unless someone else is interested to work on this, it probably won’t get done.

marcoscaceres commented 5 years ago

Yes, that’s also my understanding Also. We can provide guidance, but unlikely can work on the PR.

wetneb commented 5 years ago

Ok, I should clarify my take on this. When I first opened the issue in ReSpec I had a simple idea in mind (the one I demonstrated in w3c/respec#2569), which I could afford spending time on.

With the architecture you are proposing, it looks like this would take me much more time. As I tried to explain above, I will surely figure out issues with Crossref's metadata as I use the feature. So I will want to tweak the renderer and the code that does the translation from Crossref's format to yours. To add another field for instance, this means getting PRs merged in both repositories, with potential disagreements about the format and translation code. If this metadata is to be stored in a JSON file indexed in the git repository, like the other references in Specref, this probably means maintaining and running a bot to push updates, which is also adding needless maintenance effort as far as I can tell.

Also, I have doubts about the usefulness of coming up with yet another bibliographic metadata format, since there are plenty of well established alternatives. Spending time to create and maintain a translator from Crossref's format to Specref's does not feel like a good use of my time (having done the exact same thing in other projects before and not enjoying it very much!).

So that's my motivation. As a potential contributor, I want to spend my time on something I enjoy, I want to implement an approach I believe in (and ideally I want to learn stuff on the way too). I also want to keep my own time in control: make sure the hours I spend on this will eventually deliver what I need. Having to submit PRs to both repositories and witnessing some differences in the way we see things, by experience, my intuition is that this is likely to take me more time than I can afford and could well result in nothing being released.

That being said, I would be very happy if your architecture got implemented. If not, my own fork works fine for me. I hope you guys take this in a positive way, I totally respect your own choices and I enjoyed the conversation. Thanks for maintaining this!

tobie commented 5 years ago

With the architecture you are proposing, it looks like this would take me much more time. As I tried to explain above, I will surely figure out issues with Crossref's metadata as I use the feature. So I will want to tweak the renderer and the code that does the translation from Crossref's format to yours. To add another field for instance, this means getting PRs merged in both repositories, with potential disagreements about the format and translation code.

Why can't you do this work upfront? Figuring out how to reference a document in a way that serves the purpose of referencing it (i.e. make it possible for someone to find it and read it if they want to) doesn't seem like a huge endeavor.

If this metadata is to be stored in a JSON file indexed in the git repository, like the other references in Specref, this probably means maintaining and running a bot to push updates, which is also adding needless maintenance effort as far as I can tell.

There is already a bot automating all of this for Specref. This is pretty much what specref is.

Also, I have doubts about the usefulness of coming up with yet another bibliographic metadata format, since there are plenty of well established alternatives.

Precisely. Why would we add another data format to Specref when there's a perfectly fitting one that already exists and solves the intended use cases?

Spending time to create and maintain a translator from Crossref's format to Specref's does not feel like a good use of my time (having done the exact same thing in other projects before and not enjoying it very much!).

We have numerous translators for different data sources. They were written once, some of them years ago. I don't think anyone of them were ever updated since. They're generally between 5 to 10 lines of code.

So that's my motivation. As a potential contributor, I want to spend my time on something I enjoy, I want to implement an approach I believe in (and ideally I want to learn stuff on the way too). I also want to keep my own time in control: make sure the hours I spend on this will eventually deliver what I need. Having to submit PRs to both repositories and witnessing some differences in the way we see things, by experience, my intuition is that this is likely to take me more time than I can afford and could well result in nothing being released.

Yes, I understand you want to respect your time. I'd appreciate if you considered mine similarly.

That being said, I would be very happy if your architecture got implemented.

Unfortunately, this tends not to happen by magic and requires actual work.

If not, my own fork works fine for me. I hope you guys take this in a positive way, I totally respect your own choices and I enjoyed the conversation. Thanks for maintaining this!

Your attitude is the canonical example of why maintainers of open source projects burnout and stop working on open source.

I'm closing this issue as it makes me sad.

wetneb commented 5 years ago

I am really sorry that it makes you sad! I apologize if you feel like I abused your time.

Why can't you do this work upfront? Figuring out how to reference a document in a way that serves the purpose of referencing it (i.e. make it possible for someone to find it and read it if they want to) doesn't seem like a huge endeavor.

Crossref contains 100 million references, contributed by many different publishers, who all have different curation pipelines, so there is a lot of variability there. They still use Crossref's format, sure, but they have found many ways to shoehorn their data into it. There are a lot of publication types too. It's a rich data model, with a long history.

There is already a bot automating all of this for Specref. This is pretty much what specref is.

I assume integrating Crossref would mean adding support for Crossref in that bot, and that's typically not the sort of work I had in mind.

Why would we add another data format to Specref when there's a perfectly fitting one that already exists and solves the intended use cases?

I clearly do not want to add another data format there: as I showed in my proof of concept, I don't think Crossref metadata belongs in Specref at all. According to this repository's description, it is "An open-source, community-maintained database of Web standards & related references". Why should it become a proxy to fetch references for academic papers?

What I mean by reusing bibliographic standards is that Specref itself could use an established format for the references it stores. There is a lot of work on this, there are even standards to represent the rendering process like the Citation Style Language. If it used a standard format, I would be more inclined to work on a translator.

Unfortunately, this tends not to happen by magic and requires actual work.

I am very well aware of that - I suspect it is more than writing "5 to 10 lines of code" so that's why I am pulling back.

Your attitude is the canonical example of why maintainers of open source projects burnout and stop working on open source.

I am really sorry. I would be interested to know what I could have done differently in our interactions to make you happier.

tobie commented 5 years ago

If it used a standard format, I would be more inclined to working on a translator.

The Specref issue tracker is littered with people wanting Specref to move to whatever format they fancy, with no concrete use cases in mind or actual benefits for the users of Specref. Unsurprisingly, none are ever willing to do the actual work it takes to make it happen.

Specref has a ton of tech debt. Was built on a shoestring budget and has no resources. And yet it enables hundreds of specs to have up to date references to thousands of resources. That's its purpose. Not fulfilling anyone's pet dream about data formats.

I am really sorry. I would be interested to know what I could have done differently in our interactions to make you happier.

Understand the constraints (time, resources, maintenance costs, existing architecture, etc.) the project and its ecosystem of API consumers are working under. Focus on enabling an actual outcome that's good enough most of the time. Be humble. Ask questions. Do the work.

I'd add that this isn't a bout "making me happier," just about respecting other people's time, not only your own.

I'm going to be locking this issue in the interest of protecting my time. As I don't want to debate this more.

tabatkins commented 4 years ago

Sidestepping the conversation about integrating this into SpecRef itself, I've got no problem integrating an additional data source into Bikeshed's data, if it's high-quality and useful.

As Tobie alludes, I'd prefer getting all the data source directly; an http-only API means you can't build a Bikeshedded spec offline. That said, I'm not against it philosophically; I already have some HTTP-based APIs, like the "github issues" API, that you can opt into. And since CrossRef claims to have 100 million papers, that's, uh, a lot of data for every Bikeshed user to download. (The current Bikeshed data files directory is about 50MB; this would increase it to several gigs.)