Closed MansMeg closed 2 months ago
I think we should also look into generating a DOI for every release (perhaps only major versions(?) -- to be discussed), whether via zenodo or some other service.
Could you then use the DOI of the most recent version of whatever repo as the ID in your r package without adding extra fields to .cff specification?
I think I need something that just says what repo it is, i.e. Is it riksdagen-records, riksdagen-interpellations, etc? I think the DOI is a good idea for major releases anyway, but that will not solve this problem.
You could use:
identifiers:
- description: "repository"
type: other
value: "riksdag-records"
The other
is a bit lame, but it would serve the purpose.
Who should the authors of the corpus be? Same as the LREC paper?
I think we should point to the LREC paper both for the records and person datasets.
The question came up this morning about citing the paper vs citing the dataset in the repo / software in the repo, and I think after looking at the cff documentation more carefully that the cff files are meant to cite the data/software, and actually there is not even a key within the schema to cite descriptions of the resource (like our LREC paper). I would make the cff files following the example below, and when the paper is published, we can archive it and either point to it in the readme or include it in a 'publications' or similar zip with dated releases on the landing page.
cff-version: 1.2.0
message: To cite this reposository, please use these metadata.
title: The Swedish Parliament Corpus: Riksdagen Records
version: v1.0.0
authors:
- given-names: Väinö
family-names: Yrjänäinen
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: ninpnin
- given-names: Fredrik
family-names: Mohammadi Norén
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: fredrik1984
- family-names: Borges
given-names: Robert
orcid: "https://orcid.org/0000-0002-7647-4048"
alias: BobBorges
- given-names: Johan
family-names: Jarlbrink
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: JoJanotrealalais
- given-names: Lotta
family-names: Åberg Brorsson
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: Lottabrorson
- given-names: Anders P.
family-names: Olsson
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: AnOlnotrealalias
- given-names: Pelle
family-names: Snickars
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: PeSnnotrealalias
- given-names: Måns
family-names: Magnusson
orcid: "https://orcid.org/1111-2222-3333-4444"
alias: MansMeg
date-released: 2024-04-17
identifiers:
- description: This is just the repository basename; we need it for the R package.
type: other
value: riksdagen-records
- description: DOI of each release. We should archive releases with a doi as part of the workflow.
type: DOI
value: 01.xxx/zenodo.yyyyy
license: MIT
repository-code: "https://github.com/swerik-project/"
url: "https://github.com/swerik-project/the-swedish-parliament-corpus"
type: dataset
I think this looks great. Since it is data rather than code, maybe we should use CC0 instead of MIT. I think that is what @Lottabrorsson has said before.
@MansMeg We use the attribute licence https://creativecommons.org/licenses/by/4.0/
In other news, the licence tag is missing in prot-ek.xml (compare with prot-fk.xml)
OK, so CCBY4.0 then? I just grabbed the license off the westac page -- CC is better though.
I think this looks great. Since it is data rather than code, maybe we should use CC0 instead of MIT. I think that is what @Lottabrorsson has said before.
@BobBorges @MansMeg We have CC0 for the parliamentary documents which we made available via KB. But maybe we have to review it.
If I understand, the main difference between CC0 and CCBY is that reuse of CCBY requires attribution when people use or reuse the corpus. I think CCBY is becoming fairly standard for academia. Personally, I like CCBYSA, which is like CCBY except any reuse of the material falling under the license must be released under the same, or at least not more restrictive, license. @fredrik1984 could you check if anything is specified about this in the funding agreement -- the discussion here might be moot if the license is already decided for us.
In the application we say that we will give the corpus a "CC BY-license". However, this can be adjusted if needed.
I think we should use cc0 as @Lottabrorsson say. As you say bob, ccby is common in academia, but I dont think we can or should require attribution. Most academics will cite us anyway. Making it more free should not be a problem.
Sounds good to me!
I think we should use cc0 as @Lottabrorsson say. As you say bob, ccby is common in academia, but I dont think we can or should require attribution. Most academics will cite us anyway. Making it more free should not be a problem.
@MansMeg @BobBorges @fredrik1984 We also have the Public domain option. I would like to get back to you about this.
Great! I would like to go as free as is only possible.
We should add a citation file to all our data repositories. This will fulfill two purposes:
Here is more information: https://citation-file-format.github.io/
I'm thinking of us requiring (using a unit test) that all repos has an identifier to the repo as:
Then, we can all use this id in the CFF to check the repository. E.g. I need to use this in the R package. Even better would be something like:
but that is not supported by the CFF format.
Any thoughts on this @ninpnin or @BobBorges ?