License-able or Open Datasets

martinheidegger commented 3 years ago

If we want to use the project in a commercial context we need to build the model based on licensed or open training data.

@fyr91 do you know if the glint360k model is licensable? Do you have experience of how much it would cost to license this or a similar data set? Or maybe have a means to figure that out?

fyr91 commented 3 years ago

Just checked the Glint360K data description. Model trained with this dataset is not allowed for commercial use. Can refer to link . I dont have any knowledge of how much it could cost. For our previous projects, we were using model trained from smaller dataset that is allowed for commercial activitiy and we were also collecting our own training dataset from authorized patient data as we are working with hospitals. I could check on other data source to see which we can use for commercial purpose.

martinheidegger commented 3 years ago

Can you share a link to those datasets? We may need it for comparison at some point.

spwilko commented 2 years ago

As fyr91 has said, the dataset is not licensable I've emailed the lead contributor to ask how we can license the model

urbien commented 2 years ago

he said it differently, he said it can't be used for commercial needs without a license who did you email and when?

spwilko commented 2 years ago

I asked Jia Guo today We're interested in using the model commercially.

Jia responded Due to dataset license, all pre-trained models from public datasets such as MS1M,Glint360K,WebFace42M are not available for commercial use. But we have commercial solutions, for the whole pipeline of face recognition by using our own private datasets. We can have a further discussion if you're interested.

martinheidegger commented 2 years ago

Just had a conversation with @spwilko about this and I have not been be clear enough on what we need. We do not need a license for the pre-trained model.

We need to be able to provide a model to our customers (who will use it for commercial purposes). We can train that model ourselves but the data-set for this model needs to allow commercial-use by our customers. Which gives us following options:

Ask the Glint360k model provider ...

... if our customers can get a commercial license for the data-set so that they can create the model. (with our help) We'd need a fixed quote for that.
... if we can get a reseller license for the Glint360k data set.
... if they could publish it under a license that allows commercial use (one can dream).

or we find a different data-set equivalent to the Glint360K model (360k identities/17m images) that fulfills one of the criteria above.

martinheidegger commented 2 years ago

@spwilko pointed out correctly in a off-repo conversation that Glint360K is based on CASIA, Celeb-500k and MS-Celeb-1M and asked if this means we could use these datasets instead of Glint360k. Answering this has been a bit more insightful than expected which is why I am elaborating here.

First, no: it is important to understand that selecting and cleaning a data-set is the valuable work. The act of combining and cleaning is very laborious. The value of the Glint360k data-set is derived of it well combining other data-sets.

Nevertheless the thought is interesting, could we use the same sources and derive the data-set from that?

Celeb-500k - the problem with this is that the data is not easy downloadable (problematic to download) and in the metadata no license is described. I could find the paper only behind a paywall. I am wondering what license it is.
MS-Celeb-1M - this data-set was retracted by Microsoft due to an article in the financial times which links the usage of the data to tracking in china. Furthermore the MS-Celeb-1M dataset consists of copyrighted material.

The article sparked the creation of exposing.ai - an analytical site examining different data-sets. Interestingly Glint360k is not mentioned.

This brought me to the very interesting article by the creative commons on licensing in AI. Very recommended reading!: :point_right: :point_right: Should CC licensed content be used to train AI

urbien commented 2 years ago

thanks, good info! Note that there exists an even bigger dataset webface260m which was used to train one of the best performing algorithms in the NIST test, see links below: https://ai-scholar.tech/en/articles/face-recognition/webface260m https://www.face-benchmark.org/download.html

This dataset is also for academic use. So @spwilko, you need to inquire about the licensing the dataset (not the model trained on it) with the authors of the paper.

urbien commented 2 years ago

@martinheidegger , exposing.ai lists MS-Celeb-1M with the following License: Open Data Commons Public Domain Dedication (PDDL)

martinheidegger commented 2 years ago

@urbien this is where a good part of the controversy and probably Microsoft's decision to retract it comes from.

Despite several erroneous reports mentioning the MS-Celeb images were derived from Creative Commons licensed media, the MS Celeb images were in fact obtained from web search engines. The authors mention "they were obtained by "retriev[ing] approximately 100 images per celebrity from popular search engines"[^msceleb_orig]. Many, if not the vast majority, are copyrighted images.

Other topic: I found ffhq-dataset which may be a good alternative source.

urbien commented 2 years ago

ffhq-dataset seems good but very small. Look at my comment above on webface260m and impressive results with the use of that dataset (3rd place in NIST)

martinheidegger commented 2 years ago

Looked longer into webface260m:

To build the WebFace260M, we first obtain a list of 4M people from MS1M (composed of Freebase) and IMDB. We then used Google image search to retrieve the images.

Looks like they didn't consider license in the creation of webface260m as IMDb licensing of their data is very restrictive, and google image search does return cc licensed images but since it's not specifically mentioned my guess is that it uses other images as cc images are rather rare.

martinheidegger commented 2 years ago

Following #11's latest comment by @fyr91 I looked into the Wider Dataset which is used for the general face detection. I downloaded it (it is much smaller: 3.6GB) and it contains 16000 photos of 61 categories which seem to be mostly under closed license (not commercial use). In those 16000 photos, every face is marked with a bounding box, blur amount and the points of the eyes, nose and mouth corners. The dataset is pretty diverse across several dimensions: i've seen wedding pictures, music events, dvd covers, renaissance paintings, stock photos, etc.

We need ~~probably~~ an alternative for that as well.

tradle / KYCDeepFace

License-able or Open Datasets #1