google-research-datasets / conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
Other
516 stars 26 forks source link

Ethics of this research set #20

Open robrwo opened 4 months ago

robrwo commented 4 months ago

An organisation that I work for has been having problems with robots requesting images from their website for AI training. We've managed to contact one of the people operating the robots who said they were using this dataset, and claimed because https://github.com/google-research-datasets/conceptual-captions/blob/master/LICENSE says "The dataset may be freely used for any purpose" that they had the right to use these images.

The problem is that you are publishing a dataset of non-Google URLs:

  1. Google has no control of the hosted images, and they may be changed or removed or blocked, e.g. #17.

  2. Google is not paying the hosting costs of these images. Organisations have to pay for bandwidth, CPU time, or even the number of requests.

    So every time a user of this dataset requests the image, somebody else pays for it. (This is incentive to block or remove the images, see no. 1).

  3. These images were added without the consent of the organisations, who have to pay costs of hosting (see no 2).

  4. The images were added without the consent of the copyright holders (who may be different from the server hosts).

  5. This dataset was created before 2018, before concerns about the use of images for AI training were common, and before protocols to disallow use of web-hosted media for machine learning existed.

  6. Many of the images URLs are hosted by stock photo agencies, and may not be licensed for machine-learning use. They may also regard the captions (which require human effort to write) as part of their intellectual property.

  7. Many of the images are on news websites, and were licensed from stock photo agencies, so may not be licensed for machine-learning use.

  8. Many of the photos are hosted outside of the USA, by organisations which are not based in the USA, so US "fair use" copyright exceptions do not apply.

It would have been ethical for Google to license copies of the images, and then host them as part of the dataset (but still publish the URLs where they originally came from).