swerik-project / the-swedish-parliament-corpus

A repository for managing public, versioned releases of the Swedish Parliament Corpus.
4 stars 0 forks source link

Add CFF files in all API repositories #1

Closed MansMeg closed 2 months ago

MansMeg commented 3 months ago

We should add a citation file to all our data repositories. This will fulfill two purposes:

  1. It will simplify the citations for users (so they can get one bibtex quickly per repo they use)
  2. It can work as a tom.xml file for the repository, ie the R package can check which repo it is by looking at the CFF file.

Here is more information: https://citation-file-format.github.io/

I'm thinking of us requiring (using a unit test) that all repos has an identifier to the repo as:

identifiers:
  - description: "The main repository URL"
    type: url
    value: "https://github.com/swerik-project/riksdag-records"

Then, we can all use this id in the CFF to check the repository. E.g. I need to use this in the R package. Even better would be something like:

identifiers:
  - description: "repository"
    type: id
    value: "riksdag-records"

but that is not supported by the CFF format.

Any thoughts on this @ninpnin or @BobBorges ?

BobBorges commented 3 months ago

I think we should also look into generating a DOI for every release (perhaps only major versions(?) -- to be discussed), whether via zenodo or some other service.

Could you then use the DOI of the most recent version of whatever repo as the ID in your r package without adding extra fields to .cff specification?

MansMeg commented 3 months ago

I think I need something that just says what repo it is, i.e. Is it riksdagen-records, riksdagen-interpellations, etc? I think the DOI is a good idea for major releases anyway, but that will not solve this problem.

BobBorges commented 3 months ago

You could use:

identifiers:
  - description: "repository"
    type: other
    value: "riksdag-records"

The other is a bit lame, but it would serve the purpose.

BobBorges commented 3 months ago

Who should the authors of the corpus be? Same as the LREC paper?

MansMeg commented 2 months ago

I think we should point to the LREC paper both for the records and person datasets.

BobBorges commented 2 months ago

The question came up this morning about citing the paper vs citing the dataset in the repo / software in the repo, and I think after looking at the cff documentation more carefully that the cff files are meant to cite the data/software, and actually there is not even a key within the schema to cite descriptions of the resource (like our LREC paper). I would make the cff files following the example below, and when the paper is published, we can archive it and either point to it in the readme or include it in a 'publications' or similar zip with dated releases on the landing page.

cff-version: 1.2.0
message: To cite this reposository, please use these metadata.
title: The Swedish Parliament Corpus: Riksdagen Records
version: v1.0.0
authors:
  - given-names: Väinö 
    family-names: Yrjänäinen
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: ninpnin
  - given-names: Fredrik 
    family-names: Mohammadi Norén
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: fredrik1984
  - family-names: Borges
    given-names: Robert
    orcid: "https://orcid.org/0000-0002-7647-4048"
    alias: BobBorges
  - given-names: Johan 
    family-names: Jarlbrink
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: JoJanotrealalais
  - given-names: Lotta 
    family-names: Åberg Brorsson
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: Lottabrorson
  - given-names: Anders P. 
    family-names: Olsson
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: AnOlnotrealalias
  - given-names: Pelle 
    family-names: Snickars
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: PeSnnotrealalias
  - given-names: Måns 
    family-names: Magnusson
    orcid: "https://orcid.org/1111-2222-3333-4444"
    alias: MansMeg
date-released: 2024-04-17
identifiers:
  - description: This is just the repository basename; we need it for the R package.
    type: other
    value: riksdagen-records
  - description: DOI of each release. We should archive releases with a doi as part of the workflow.
    type: DOI
    value: 01.xxx/zenodo.yyyyy
license: MIT
repository-code: "https://github.com/swerik-project/"
url: "https://github.com/swerik-project/the-swedish-parliament-corpus"
type: dataset
MansMeg commented 2 months ago

I think this looks great. Since it is data rather than code, maybe we should use CC0 instead of MIT. I think that is what @Lottabrorsson has said before.

ninpnin commented 2 months ago

@MansMeg We use the attribute licence https://creativecommons.org/licenses/by/4.0/

In other news, the licence tag is missing in prot-ek.xml (compare with prot-fk.xml)

BobBorges commented 2 months ago

OK, so CCBY4.0 then? I just grabbed the license off the westac page -- CC is better though.

Lottabrorsson commented 2 months ago

I think this looks great. Since it is data rather than code, maybe we should use CC0 instead of MIT. I think that is what @Lottabrorsson has said before.

@BobBorges @MansMeg We have CC0 for the parliamentary documents which we made available via KB. But maybe we have to review it.

BobBorges commented 2 months ago

If I understand, the main difference between CC0 and CCBY is that reuse of CCBY requires attribution when people use or reuse the corpus. I think CCBY is becoming fairly standard for academia. Personally, I like CCBYSA, which is like CCBY except any reuse of the material falling under the license must be released under the same, or at least not more restrictive, license. @fredrik1984 could you check if anything is specified about this in the funding agreement -- the discussion here might be moot if the license is already decided for us.

fredrik1984 commented 2 months ago

In the application we say that we will give the corpus a "CC BY-license". However, this can be adjusted if needed.

MansMeg commented 2 months ago

I think we should use cc0 as @Lottabrorsson say. As you say bob, ccby is common in academia, but I dont think we can or should require attribution. Most academics will cite us anyway. Making it more free should not be a problem.

fredrik1984 commented 2 months ago

Sounds good to me!

Lottabrorsson commented 2 months ago

I think we should use cc0 as @Lottabrorsson say. As you say bob, ccby is common in academia, but I dont think we can or should require attribution. Most academics will cite us anyway. Making it more free should not be a problem.

@MansMeg @BobBorges @fredrik1984 We also have the Public domain option. I would like to get back to you about this.

MansMeg commented 2 months ago

Great! I would like to go as free as is only possible.