transpose-publishing / policies-database

Database of journal policies: TRANsparency in Scholarly Publishing for Open Scholarship Evolution
Creative Commons Zero v1.0 Universal
20 stars 12 forks source link

Data fields #1

Closed jpolka closed 6 years ago

jpolka commented 6 years ago

2018-05-08 update: Please see this comment for a revised architecture and list of data fields.

For each journal record, we hope to assemble the following fields (draft; please edit):

Journal information

Peer review

Open Peer Review

Co-reviewers

(NEW suggestion) Peer review transfer

(NEW suggestion after conversation with B. Konforti) Peer review credit

Preprints

All data from SHERPA RoMEO is released under a CC BY-NC-SA 2.5 license and we will follow the specified conditions for reuse.

tonyR-H commented 6 years ago

For peer review, add

jpolka commented 6 years ago

Done @tonyR-H - also proposed sections on peer review transfer - one question could come from the policy, but the question about structure could be collected in the same way as the one about whether there is a space in the form for co-reviewers.

dhimmel commented 6 years ago

The outline of the data fields is useful. There are two aspects I think we want to decide on before getting started.

First, what catalog of journals to use? This is important because the resource we choose here will define what journals are in the database. Currently, the plan appears to be to use RoMEO journals. Other options would be Crossref or Scopus. Crossref is probably best IMO, as it seems to be the longterm provider for this type of information that the community is coalescing behind. They also claim no copyright on the metadata. Compared to Scopus, Crossref is more dependent on publishers depositing information, which can lead to issues, but the DOI system has already matured quite a bit and things are improving. Using ISSNs we can map journals between resources as well.

Second, is there an architecture where we can release the data we create as CC0? Under the current plan, will we be required to release the data we create under a CC BY-NC-SA license (because of share alike)? If so, I think that is highly undesirable. If we are going to put in the time, it'd be nice to contribute to an open ecosystem rather than grow a closed one. One option would be separate the original data we collect from RoMEO data. We could then combine the two into a CC BY-NC-SA resource as needed, but the raw data we create will remain unencumbered. This would even be a natural workflow were we to use Crossref to identify journals.

So both issues really relate to how central RoMEO will be in this effort. I see why it makes sense to simply extend RoMEO from a convenience perspective and due to the overlap in scope. However, the licensing issues are worrisome. In addition, I am not sure how RoMEO manages its journal database and whether it will be as reliable as Crossref's catalog going forward. Apologies for derailing the focus of this issue :smirk_cat:

jpolka commented 6 years ago

@dhimmel , I think there are two really useful features of the RoMEO data. The first is that journals are mapped to not only publishers (I imagine this is true for Crossref? Wikidata is not unfortunately) but also groups of journals (for example, Cell takes its policies from the Cell Press policies, not its parent, Elsevier). I am not sure how these relationships are represented in Crossref. Speaking of this, it might be good to have a way for people to indicate whether the policies they are inputing are publisher (or journal group) defaults or journal-specific...

Second, RoMEO has preprint information. This will make it much easier to at least focus attention relating to preprints on journals at which these issues are not entirely moot.

If publisher relationships are available in Crossref, these data could be used for everything but the preprint fields. If we are already asking people to tell us which preprint version is allowed, we could also just make one of the options "none" which functionally identify the preprint policy of the journal. In this way the information displayed via CI could be entirely CC0. But presumably the YAML files could contain the SHERPA preprint information to help contributors focus their attention on relevant journals. What do you think?

Regardless of licensing, I think it would be great to be able to make sure that SHERPA RoMEO or other parties could display some of the information we collect along with RoMEO data, if they wanted to. Therefore interoperability with their identified publishers would be appealing.

dhimmel commented 6 years ago

journals are mapped to not only publishers but also groups of journals (for example, Cell takes its policies from the Cell Press policies, not its parent, Elsevier).

Cool! So RoMEO includes an ontology of publishing entities. And the curators job here will be to annotate entities with policies. Each entity would have a YAML file, and we would apply policies to the highest-level entity to which a policy applies (i.e. Cell Press not Cell). When displaying the information, we can make subentities inherit the policies of superentities (by propagating policies down the acyclic directed graph). Assuming that the ontology data is readily available and that RoMEO maintains this resource, I agree it will be preferable to Crossref.

it would be great to be able to make sure that SHERPA RoMEO or other parties could display some of the information we collect along with RoMEO data

Certainly! I'd like our work to be CC0 so anyone can reuse this information without having to worry about legal impediments. Hopefully, we can make it quite easy for other projects to integrate our data, so RoMEO, Wikidata & others can help broadcast this information. The main question is what architectures will permit us to openly release our work, given that RoMEO data is released under CC BY-NC-SA 2.5? The important aspects of the license deed are the defition of "derivative work":

"Derivative Work" means a work based upon the Work or upon the Work and other pre-existing works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which the Work may be recast, transformed, or adapted, except that a work that constitutes a Collective Work will not be considered a Derivative Work for the purpose of this License.

In addition, here is the share alike stipulation:

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License, a later version of this License with the same License Elements as this License, or a Creative Commons iCommons license that contains the same License Elements as this License (e.g. Attribution-NonCommercial-ShareAlike 2.5 Japan).

How I understand these clauses and copyright law is that any files that include RoMEO data may be derivatives and trigger Share Alike. However, files that don't contain RoMEO data but that were computed using RoMEO data wouldn't trigger Share Alike. Hence, we can use the RoMEO ontology to propagate policy annotations without having to share alike the resulting outputs. However, we will need each source YAML file to contain at least the RoMEO entity ID. Perhaps SHERPA would be willing to allow us to use the journal IDs and names in a CC0 resource. This would be similar to the licensing approach DrugBank adopted where basic drug information and identifiers are CC0 with the detailed data under a non-commerical license. This allows using the resource's identifiers without viral licensing issues.

So I think we should ask SHERPA if we can release files with journal/publisher IDs as CC0. Under this proposal, files that contained preprint links or other relationships from RoMEO would still be CC BY-NC-SA. However, our source files would be CC0 opening up the possibility for unrestricted reuse to users who do not need the RoMEO aspects of the data.

If we cannot use RoMEO IDs under CC0, then our choices would be to consider fair use or to consider replicating SHERPA's journal ontology efforts to create an openly-licensed resource. I think duplicating these efforts would be considerably worse for both us and SHERPA, so hopefully we avoid that route.

jpolka commented 6 years ago

Daniel, do you know how similar the ISSNs in Crossref and SHERPA are? (a list is available w/our API key at http://www.sherpa.ac.uk/downloads/) If they are similar, I assume we could match journals to entities via Crossref ISSNs.

This isn't a given, though: in Wikidata, some journal have many ISSNs (print and electronic versions have separate ISSN - in SHERPA the latter are called ESSNs, I believe) and there can be multiples of each (presumably if a journal changes publishers?)

jpolka commented 6 years ago

Seeing your Crossref issue, could we also do this matching with journal names?

dhimmel commented 6 years ago

could we also do this matching with journal names?

We should avoid automated mapping by names at all costs. The different databases will likely use different names for the same journals. If we do need to map journals by names, we will need to manually review each mapping (and consider whether we can contribute information so the mapping can be done automatically in the future).

BTW in the past, I've mapped Crossref DOIs (articles) to journals using ISSNs, so I know they have this information (at least for specific articles).

jpolka commented 6 years ago

While the Crossref API issue gets sorted, a .csv of Crossref journals with print and electronic ISSNs can be found here: https://support.crossref.org/hc/en-us/articles/213197226-Browsable-title-list

jpolka commented 6 years ago

To summarize and clarify the question we are hoping to ask about reuse of SHERPA ontology:

We would like to collect and release information under CC0 so as to maximize the potential for its impact and reuse.

However, we also would also like to have journals in our database mapped to their publisher or publisher group (ie Cell Press, which has a parent Elsevier) since that is the level at which many policies are expected to operate. As far as we know, SHERPA RoMEO's (Share Alike) data is the best source of these relationships. When users edit a journal entry, they could be asked whether they want the changes to propagate to the whole journal group (provided its name and perhaps list of journals in it), or whether the policies apply JUST to this particular journal, or a subset of journals in the journal group (and therefore, a new policy entity would be created as a child of its parent).

In order to accomplish this mapping, the most straightforward approach would be to include SHERPA RoMEO's field romeopub, which identifies the relevant policy, as a field in the database. A second way to do this would be for us to make our own identifiers that are mapped to romeopub via a data structure stored in a separate file. This file would be used essentially any time a user updates the database in order to propagate the changes to appropriate entries.

We should therefore ask SHERPA whether they either A) would be willing to release a resource that maps entities to one another (jtitle or ISSN to romeopub, and romeopub to parentid) under CC0 so that our data can be openly reusable even if it includes romeopub in the database entries, or B) ask them if they agree with Daniel's interpretation of Share Alike:

However, files that don't contain RoMEO data but that were computed using RoMEO data wouldn't trigger Share Alike.

If they do agree, then we could use a separate data structure file (released under SA, since it contains romeopub) to propagate information from journal entries (released under CC0) to one another, though the file would be licensed under SA.

Option C) might be to include romeopub in the database entries, but then strip it out to export a CC0 copy of the database that lacks ontology but still is useful for looking up policies. It would be less reusable, though.

A second request might be whether fields relating to preprints (prearchiving, prerestrictions, copyrighturl) could also be released CC0. Alternatively, we could include them initially and strip them out later. This would not adversely affect the reuse of the database because the information would be effectively reproduced in more detailed data fields (such as what version, if any, of a preprint is acceptable).

Other information (such as journal name, ISSN) could be taken from Crossref (which is CC0), so it is really these two areas (romeopub and their parentids, and the set of fields concerning preprints) that we are interested in.

Let me know if this is a fair summary of our options!

SamanthaHindle commented 6 years ago

Hey @jpolka I added one point above "- Does the journal make it clear in the invite email that co-reviewers can contribute? (yes/no)"

I realized afterwards that I probably should have listed it here instead. Feel free to remove or edit it.

I'm glad to see the inclusion of the journal's policy on incorporating community reviews into peer review 😁 ❤️

jpolka commented 6 years ago

Great @SamanthaHindle - thank you! I also updated the text to include the following:

(NEW suggestion after conversation with B. Konforti) Peer review credit

jpolka commented 6 years ago

Following a conversation with @dhimmel, we will have one YAML file per romeopub (policy ID from SHERPA/RoMEO).

We will make it clear to contributors that their contributions are licensed CC0. However, the policy files will be licensed CC BY-NC-SA 2.5 because they contain SHERPA RoMEO information. We will follow the specified conditions for reuse.

Here is a revised list of data fields. @SamanthaHindle, @tonyR-H and @garymcdowell - let me know what you think? Specifically, if we collect the urls, do we need the "journal's policy about" fields, which may be very long/unwieldy?

Furthermore, after a conversation with Monica Gradanos, it would be helpful to also set up a Google form to lower the entry barrier to making changes for users who are not comfortable with GitHub. If we could transform the .csv output into the YAML format, cutting and pasting into the database would be easier.

Data fields

Suggested labels for the YAML file are bolded. Comments follow. (Allowed responses for validation in parentheses - these should also be visible in the comments.)

romeopub Policy ID from SHERPA/RoMEO (Do not edit) journals A list of the journals from SHERPA/RoMEO associated with this romeopub- "jtitle" (Do not edit)

Might also be helpful to list these to help users navigate through journal families - would be excellent if they could be links

parent-policy The Policy ID from SHERPA/RoMEO that is the parent of this policy - "parentid" (Do not edit) child-policies A list of the Policy IDs from SHERPA/RoMEO that list this Policy ID as its parent (Do not edit)

These fields should probably not be editable, but we should provide a way to flag the entry if the journals listed are not all affected by the same policies

Peer review

Open Peer Review

Co-reviewers

Peer review transfer

Peer review credit

Preprints

The following information about preprints may not be found in the standard preprint policy. Therefore I suggest creating a url field for each

dhimmel commented 6 years ago

I'm closing this issue since we now have a schema in policies/schema.yml!

It is still possible to make updates to the schema but it will be more complex. If you would like to change the schema, please open a new issue and we can discuss the best path forward.