tdwg / camtrap-dp

Camera Trap Data Package (Camtrap DP)
https://camtrap-dp.tdwg.org
MIT License
45 stars 5 forks source link

Image license #189

Closed peterdesmet closed 2 years ago

peterdesmet commented 2 years ago

I think Camtrap DP should allow publishers to indicate the license of the images. It can be different from the license of the data. In general, Camtrap DP doesn't make many assumptions about the images (size, whether they are accessible, etc.), but most of those properties can be derived (by machines) by following the path or URL. That is not the case with the license, which is why I think it would be good to have it as a term in media.csv.

This issue is not about what license(s) should be applied to media files, only that it should be possible to indicate it.

Field properties

Proposal

{
      "name": "license",
      "type": "string",
      "format": "uri",
      "description": "URL of the license under which the media file is provided. The rights holder can be indicated in `rightsHolder` in the data package metadata.",
      "example": "https://creativecommons.org/publicdomain/zero/1.0/",
      "constraints": {
        "required": false
      }
}

Size increase

This proposal would add a license value to every record in the media.csv, which increases file size. However:

Credit

Ideally, we should also indicate how the image should be credited (if required by the license). Adding that information for every image might be overkill, so I suggest that the definition refer to the rightsHolder. That definition could be extended to ... owning or managing rights over this data package and associated media files.

https://github.com/tdwg/camtrap-dp/blob/293ec5dcbf27f7605138a1585272229c06b3bd68/camtrap-dp-profile.json#L80-L84

peterdesmet commented 2 years ago

@ben-norton @kbubnicki others, thoughts?

ben-norton commented 2 years ago

@peterdesmet @kbubnicki

  1. The images and data should always be licensed separately. A provider can use the same license for both, but separate licensing should be required.
  2. In most cases, I support the provision of general flexibility in areas where it doesn't compromise data integrity. This is different. We should not provide the flexibility to license every image differently in a data package. A. Most data package providers will not license images individually. Its a time consuming and usually unnecessary. This can be a problem for images that contain an entity (such as a person) that is subject to additional restrictions. However, that situation shouldn't be handled by a separate license, rather a data policy or withholding the image are better strategies. B. Allowing the license field to vary for each record in the media csv file will increase the likelihood of error during data entry. Since the field will most likely not vary, the possible rewards are minimal and therefore only the risk will remain. C. Reuse becomes very cumbersome if images are licensed individually. A user would need to parse through every single row in a media.csv file to view the license information before reusing the data.
    D. If we stipulate that all images must adhere to a single license in a data package, then there isn't a need to annotate them individually. Instead, the license field should be moved to the data package json file.

The Data Package documentation allows for an array called licenses that must contain a name and/or URL with an optional title. These are insufficient for a dual-licensing model, where one license applies to the metadata and the other applies to media. Fortunately, the licenses property is an array, which can accommodate multiple licenses. The only missing piece is the scope with a controlled vocabulary (metadata, media). If the scope field can be added to objects in the licensing array, then the problem is solved. It might look like the following:

"licenses": [{
  "scope": "media"
  "name": "ODC-PDDL-1.0",
  "path": "http://opendatacommons.org/licenses/pddl/",
  "title": "Open Data Commons Public Domain Dedication and License v1.0",
},
{
   "scope": "metadata"
  "name": "CC0 1.0",
  "path": "https://creativecommons.org/publicdomain/zero/1.0/",
  "title": "CC0 1.0 Universal Public Domain Dedication",
}
]

In terms of business rules, I suggest the following: 2 Options. Option 1.

  1. The licenses array is required.
  2. The licenses array must contain at least one license object. If one object is provided, even if the scope is included, then the license applies to both media and metadata. Example: A data package contains a single license object scoped to media that looks like the following:
    "licenses": [{
    "scope": "media"
    "name": "ODC-PDDL-1.0",
    "path": "http://opendatacommons.org/licenses/pddl/",
    "title": "Open Data Commons Public Domain Dedication and License v1.0",
    }],

    Since a second license object was not provided for the metadata, the object above applies to both, even though the scope is listed as media.

  3. A scope is required if two objects are provided in the licenses array (otherwise, it's impossible to know which one applies to which content.)
  4. Rules governing the presence of path, name, and title follow the existing frictionless documentation (https://specs.frictionlessdata.io/data-package/#metadata) Option 2.
  5. The licenses array is optional. If it is not provided, then both metadata and media are distributed under a single default license. Otherwise, numbers 2, 3, and 4 in Option 1 apply.
PietrH commented 2 years ago

I would like to weigh in, mixed license datasets are an annoying reality that seriously hinders reuse, and should be avoided as much as possible. However, while this is currently not the case for our camera trap images (CC-0, data is CC-BY-SA). I can certainly imagine deployments where a partner would like their images included in our dataset (and identified in our pipeline) but would not be able to give a license waiver like CC-0.

So, while I think it's unlikely to be a big problem for single deployment datasets, I think it will be when it comes to exchanging multi year/deployment datasets that had multiple collaborators, and perhaps different iterations of data management planning.

So, I would suggest a solution similar to DwC-A, having a license per record, and allowing (but discouraging) mixed licenses in a comment. Aligning with the Multimedia extension in this aspect is also user friendly, and might perhaps avoid some confusion.

peterdesmet commented 2 years ago

@ben-norton I like that suggestion! I'd make some minor changes to the business rules you list: I'll see how that can be implemented.

@PietrH I understand your concern, which is why I originally suggested a license per image. However:

when it comes to exchanging multi year/deployment datasets that had multiple collaborators

I think that is something they could reasonably agree upon. The scope of a Camtrap DP dataset is a study, so there is generally a single person (e.g. PI) that can make final decisions.

Aligning with the Multimedia extension in this aspect is also user friendly, and might perhaps avoid some confusion.

A translation to Darwin Core or the Multimedia extension will have to take the properties of the datapackage.json profile into account anyway (e.g. for dataset name), so it's quite easy to assign the single media license to every image when transforming to DwC.

peterdesmet commented 2 years ago

I asked the Frictionless Community regarding this approach on their Discord. This use case hasn't been encountered before, but the approach suggested by @ben-norton sounds reasonable. Here's a copy/paste of that discussion:

1 @peterdesmet

Hi all, for our frictionless camera trap data, we want data publishers to be able to indicate the license of the CSV data and the license of the image files referenced in media.csv. A datapackage license allows multiple licenses https://specs.frictionlessdata.io/data-package/#licenses We would like to build upon that to indicate scope:

"licenses": [{
  "name": "CC0-1.0",
  "scope": "data" <- License applies to the data in the package
},
{
  "name": "CC-BY-4.0",
  "scope": "media" <- License applies to the referenced media files
}]

Is that a good approach? Has anyone else encountered a similar use case? Suggestions? Note that a resource can have its own license property, but still applies to the CSV data itself, not the referenced images.

2 @augusto-herrmann

Hi @peterdesmet . Interesting question.If I understand correctly, the referenced media files are just included as links to another URL, and the media files themselves are not included in the data package and are hosted elsewhere, right? In that case, wouldn't it be the responsibility of the server which serves the media files to declare the license, instead of the data package that merely links to it?

3 @peterdesmet

That is correct, although the image files could be included as part of a data package, but Camtrap DP doesn't make any assumptions regarding that.

It could indeed be seen as the responsibility of the server, but a) the URLs hotlink the images themselves (easier to consume), so it would have to be embedded in the exif metadata, b) having to assess the license per image is a burden to the user, and c) many servers might not provide that functionality. It would therefore still be useful if the data producer can indicate that at package level.

4 @augusto-herrmann

Well, in that case, each link to each image could also be included as a resource in the data package, and their specific licenses could be indicated at the resource level. Would that be a good solution for your use case?

5 @peterdesmet

Thanks for the suggestion, but that would seriously bloat the package: some contain over 1 million+ images. It would also be difficult to consume, since every resource would need its own unique name, which a user doesn't necessarily know.

A more straightforward solution would be to indicate the license for every record in the media.csv, which was my initial suggestion in an issue discussing this https://github.com/tdwg/camtrap-dp/issues/189, but in reality, all images within a package are very likely to have the same license.

Someone suggested to use the license array, which looks like an elegant approach, but wanted to check here if that solution make sense. 🙂

6 @augusto-herrmann

Well, it does make sense, except that

  1. Apparently the "scope" attribute for licenses is not yet in the specs. https://specs.frictionlessdata.io/data-package/#licenses Are you proposing we include it?
  2. In some cases, there may be different groups of images with different licenses. Then grouping by "scope": "media" would not be enough.

7 @peterdesmet

  1. I'm not sure it is general enough to suggest in the specs. Maybe. I currently see it more as a custom attribute to differentiate multiple licenses. I always found it odd that one can indicate multiple licenses for a package, but here it comes in useful.
  2. Correct, I consider that the main downside, but for our use case likely sufficient.

8 @lwinfree

I like the license array approach, but I don't know of other use cases. I agree with you that it is better to include the license info for the images in the DP as opposed to leaving it the the server's responsibility. ("having to assess the license per image is a burden to the user" --> this is a real problem for research data, so I totally agree with you here)

9 @peterdesmet

Thanks, I think we will use that approach then, and add a property scope to the license. Although no software would currently be able to derive meaning from that, users reasonably might. And software would still be aware that two licenses are at play.

augusto-herrmann commented 2 years ago

Good discussion, @peterdesmet!

If you're going to use the non-standard "scope" attribute I would suggest that document what it means, what are the allowed values and how does one determine to which resources they apply, etc.

peterdesmet commented 2 years ago

@augusto-herrmann exactly! 👌

peterdesmet commented 2 years ago

Fixed in https://github.com/tdwg/camtrap-dp/pull/193

augusto-herrmann commented 2 years ago

This may solve the problem for camtrap-dp, but if you are going to make the scope attribute able to generalize to any data package, you realize that this is probably going to need to change, right? And that these changes might not be compatible with the current solution. I.e., specify which media types are selected by the scope when the "media" value is used.

I might even suggest using "linked_media", because the media is not even included in the data package, just linked there.

peterdesmet commented 2 years ago

attribute able to generalize to any data package, you realize that this is probably going to need to change, right?

Yes, I might do a PR to the frictionless specs to support that.

Regarding linked_media: that is a good suggestion, might write it as linked media to be consistent with other vocabs we have.