Remove unethically trained diffusion models from model hub

lukovnikov commented 1 year ago

@patrickvonplaten @patil-suraj @anton-l I wonder what your thoughts are on the arguments the artists are making on Artstation, Twitter and Instagram right now. The main point is that the artists didn't give permission for their work to be scraped and used to create an AI that will threaten their livelihood, without compensation. I've already seen multiple mentions pop up on social media where Huggingface checkpoints are being linked, which are even being advertised as trained on the work of well-known artists (without their consent obviously).

It would be nice if we and Huggingface take a well-informed position and set an example.

Below are a few links, but it's not hard to find if you just Google for it:

https://www.vice.com/en/article/ake9me/artists-are-revolt-against-ai-art-on-artstation https://www.reddit.com/r/midjourney/comments/zlim7e/artstation_this_evening/

patrickvonplaten commented 1 year ago

It would be nice if you could flag such model repositories. There is a button that says "Report this model" If relevant the Hugging Face Hub will then act upon the report.

lukovnikov commented 1 year ago

so here for example Lois Van Baarle (@loisvb on instagram) claims her images have been used (however she didn't specify which models or datasets but I'm assuming stable diffusion), i'm quoting:

... Because my artwork is included in the datasets used to train these image generators without my consent. I get zero compensation for the use of my art, even though these image generators cost money to use, and are a commercial product...

So should she go ahead and flag stable diffusion models? Do we know what subset of LAION stable diffusion was trained on?

So I went ahead and searched through LAION using their search function and quickly found examples of well-known artists. For example, these were from Lois in LAION-400m and LAION-5B:

lois-laion5b

lois-laion400m (here's her IG post for this one)

While searching for her works, I also found a few works of other artists (Ross Tran and Ilya Kuvshinov), and those are only the ones whose work is so distinct and well-known I could recognize it. There are many other artist's works included there and I doubt any of them was asked for permission. I don't know much about copyright but do you think this is ok?

Cyberes commented 1 year ago

@lukovnikov I find it really absurd that someone would complain about this in a issue on a CLI tool for distributed training. I briefly scrolled through your profile and saw that you were interested in the CelebA-HQ-256 dataset. Well, there's something called the Unauthorized Use of Identity which is the "right control your name, likeness, voice, signature, or other personal identifying traits." It's pretty hypocritical of you to whine about "AI stealing art" when you're doing the exact same thing with people's identities.

What's odd is that you appear have considerable experience with AI technology yet your complaint reads like something an uninformed average Joe would write. Are you going for the woke points? Did you tweet how "woke" you are?

I don't know much about copyright but do you think this is ok?

You can drop the act, we all know the intention of your vapid questions. I vote to close and lock this issue before the shit-flinging begins.

lukovnikov commented 1 year ago

@Cyberes Well, that escalated quickly and I guess I see how the slinging started. I see you are a prolific generative promptist yourself.

I think this issue very much deserves to be here and expected constructive arguments rather than personal attacks. If we can argue why LAION is fair to use, it would be good to have on this project since we're distributing the models.

CelebA is for non-commercial purposes only and I don't know how much the identity theft argument holds while stable diffusion is being re-trained and sold (just this morning @dansuiart (very mild NSFW warning) announced selling his kit for ~50$ for example). Mid-journey charges money for usage. So obviously these scraped images are used to create products for commercial purposes. However, the original artists have never agreed to this and are not being paid for generating training data. At the very very least, the license should not allow commercial uses and even then, I wonder how ok it is. But as I said, I don't know much about copyright, but I completely understand the artists.

And if people are unhappy about CelebA exploiting their faces, I think we should also refrain from using it and create another dataset. For example, FFHQ clearly states that all the included images were originally published under permissible licenses for non-commercial uses.

If you remember, a year ago, Github Copilot was in a similar situation, where there was risk of open-source code being used in commercial projects, which would go against license terms. I also don't know how happy the developers were to have their open-source contributions be monetized upon by Microsoft for commercial projects.

You can drop the act, we all know the intention of your vapid questions. I vote to close and lock this issue before the shit-flinging begins.

You must know more than me, like I said I expect a constructive discussion and change if needed.

pcuenca commented 1 year ago

Let's please try to keep the discussion civil. This is clearly a sensitive topic and we could all benefit from different points of view as long as arguments are presented respectfully.

patrickvonplaten commented 1 year ago

Thanks for the comments. Personally, I believe this question is a more related to a datasets library or directly the model repositories on the Hub where the weights reside.

I would be in favor of closing the issue here since we as maintainers in my opinion cannot and should not take decisions whether model weights should be taken down or not.

lukovnikov commented 1 year ago

@patrickvonplaten good point. Is this a better place: https://github.com/huggingface/huggingface_hub ?

averad commented 1 year ago

@lukovnikov 👋 the models you referenced are not hosted on Github. As such your suggestion of posting your concerns on another Github repo wouldn't align with this situation.

Time and time again people focus on a tool and blame it for problems. In actuality the person who used the tool is most often where the focus needs to be put.

What can I do? 🤷

It would be nice if you could flag such model repositories.
There is a button that says "Report this model"
If relevant the Hugging Face Hub will then act upon the report.

@patrickvonplaten referred you to https://huggingface.co/ where anyone can use the report this model button for each model they are concerned with. Each model on the huggingface hub has it's own discussion area where users can talk with the model creator as well as people that utilize the model.

Note: When you submit a report it creates a public discussion on the models community page and also pings the HF team.

Here is a link to Stable Diffusion v1 to get you started 👍 https://huggingface.co/CompVis/stable-diffusion

Documentation / Additional Info 📜

I would suggest reading over https://huggingface.co/terms-of-service, https://huggingface.co/code-of-conduct and the model(s) license(s) your concerned with.

If your concern is with a Laion image set you will want to read over https://laion.ai/faq/ and then contact the maintainer of the data set.

If your concern is with a specific person, such as dansuiart and not with a model that is hosted on huggingface.co you would want to take up the issue with that specific person or their webstore host Gumroad.

“Our dilemma is that we hate change and love it at the same time; what we really want is for things to remain the same but get better.” - Sydney J. Harris

Food for thought 🤔

A Painter posts a picture of their paintings online for all to see and admire. Another artist loves the paintings and studies them for weeks and creates their own paintings that looks similar in style. Does the painter attack their admirer or do they take this as flattery?

What if the Artist who copied the style of the Painter is discovered (goes viral) and rises to fame but not the original painter, is this unethical of the artist to take but not give to the painter?

Now lets say the Artist who copied the Painter paints their own paintings and then trains an AI model to make more paintings is this ok because the Artist created the images the model is trained on and not the Painter?

At what point is it ok and not ok to learn from another persons work when they share it for the world to see?

We should probably just limit access to pencils, pens, cameras, photocopiers, computers etc. they are just too dangerous to the art community in the Orwellian world we live in. /s

Closing Remarks🕺

AI can not recreate an Artist's brain, as such Artists have an edge over AI. Artists can come up with something new, original that AI might not output in the next million years.

We can't rewind, we've gone too far🎵 Pictures came and broke your heart🎵 https://www.youtube.com/watch?v=W8r-tXRLazs

We haven't lost our Radio Stars, they just have grown as Artists.

smol

julien-c commented 1 year ago

haven't seen this mentioned here BTW but "report this model" works: e.g. see this thread where a model repo was blocked because the artist didn't consent to his work being used to train a model

lukovnikov commented 1 year ago

@averad thank you for the very comprehensive and extensive answer. Maybe it should be extended into a blog post. I definitely don't want to blame the tools and think it's exciting to develop this technology. Most artists don't have an issue with progress although they realize it will become even harder, especially for the smaller ones, to make income, and will overall squeeze their labour market even more for everyone. The issue most artists are having is that their images were trained on without asking for their consent and then used in commercial applications that already are competing or will compete with the artistic community's main sources of income.

Regarding the food for thought, it's a very interesting question but fortunately the issue here doesn't lie that deep, it's about a data grab for applications that clearly don't fall under fair use anymore. Like you implied, generative models will (?) only ever be as good as the data and I doubt it would be as commercially attractive if it wasn't trained on copyrighted material. So I don't understand how stable diffusion models and others are allowed to be commercially used or distributed under such a license.

So to be consistent with the example @julien-c posted, shouldn't we also take down all stable diffusion models trained on LAION(-linked) images (but again, it's not clear what subset it was really trained on, from what I could quickly find). Among many others, Lois Van Baarle clearly stated she did not consent and her images are (linked) in both LAION-500m and LAION-5B.

averad commented 1 year ago

@lukovnikov thank you for sharing your thoughts. Just for clarification:

Have you or the people you are representing:
- Used the "report this model" function on hugging face to voice your concerns?
- Submitted a request to laion.ai to have image urls or clip embeddings removed?
The reason I ask is because if a model has been taken down (Such as SD v1 as its where things started) or if the data was removed from the training set (LAION5b) it would help your case.
What outcome(s) you are asking for?
- Code change?
- Removal of all Models and start over?
- Registration of Models using State or Country Licenses/Passports?
It's really unclear what you want and what your suggested solution is as your statement seems to be all models using LAION5b are illegally trained.

Once the above is clarified the discussion can move to is AI art derivative and transformative works or not. Which is the question of legality of using Art shared in a public space as a reference or as a learning tool (In my own opinion as a lay-person).

Some Resources (US Only):

A Fair(y) Use Tale: https://www.youtube.com/watch?v=CJn_jC4FNDo Professor Eric Faden of Bucknell University provides this humorous, yet informative, review of copyright principles delivered through the words of the very folks we can thank for nearly endless copyright terms.

lukovnikov commented 1 year ago

@averad I'm not really representing anyone but as a scientist, I find this situation wildly unfair. We pay annotators on mechanical turk more than what the artists received for generating the training examples (0$). I don't know much about copyright and current laws but it is very possible the laws aren't ready yet to do the right thing right now.

But some food for thought is this: do you think it's transformative to run a Fourier transform on a bunch of Disney movie frames, adding some noise in frequency domain and then reselling the inverse transforms of a random weighted sum of the transformed frames? How is a trained model fundamentally different from that, considering that during training, we're just learning to reproduce training data?

I reported stabilityai/stable-diffusion-2 on the hub and the only reply I got so far was "and?".

Personally, I think at least any model trained on LAION images should be restricted to research/education-only, if that's possible. But if you think of it as a Fourier transform of copyrighted material, I think it should not exist or be distributed at all. Regarding flagging images, don't you think it's fairer to have an opt-in strategy and train on public domain data (like apparently stabilityAI did for their music generator in order to avoid possible copyright issues)? Creators have copyright on their creations by default, and in that case their works should not be used by default.

averad commented 1 year ago

https://en.wikipedia.org/wiki/Fan_art https://en.wikipedia.org/wiki/Parody

lukovnikov commented 1 year ago

@averad From the fanart page: "The legal status of derivative fan made art in America may be tricky due to the vagaries of the United States Copyright Act. Generally, the right to reproduce and display pieces of artwork is controlled by the original author or artist under 17 U.S.C. § 106. Fan art using settings and characters from a previously created work could be considered a derivative work, which would place control of the copyright with the owner of that original work. Display and distribution of fan art that would be considered a derivative work would be unlawful.

However, American copyright law allows for the production, display and distribution of derivative works if they fall under a fair use exemption, 17 U.S.C. § 107. A court would look at all relevant facts and circumstances to determine whether a particular use qualifies as fair use; a multi-pronged rubric for this decision involves evaluating the amount and substantiality of the original appropriated, the transformative nature of the derivative work, whether the derivative work was done for educational or noncommercial use, and the economic effect that the derivative work imposes on the copyright holder's ability to make and exploit their own derivative works. None of these factors is alone dispositive."

So I don't see how these links add to the discussion.

Edit: the transformativeness of the fan art is a criterium, as well as commercial use.

averad commented 1 year ago

the transformativeness of the fan art is a criterium, as well as commercial use. And what's more interesting and relevant with all the AI apps emerging from stable diffusion and LAION is this: "economic effect that the derivative work imposes on the copyright holder's ability to make and exploit their own derivative works".

What did laion.ai say when you contacted them?

lukovnikov commented 1 year ago

@averad It's not just about LAION, which will argue that they only link elsewhere and don't distribute the images themselves. Of course, they will remove links if people ask them. I think this is the wrong way to go about it but ok. The bigger issue is that these data have already been used for stable diffusion and other models, and these are being distributed and used commercially because OpenRAIL doesn't prohibit that.

averad commented 1 year ago

Sounds like this is a bigger legal issue than a Github repo discussion can hash out. I would advise following the suggestions from 🤗 and submit the requested report this model submissions.

Your voice has been heard, please hear the responses you have received and follow through with what has been requested of you.

I respect your opinion lukovnikov and appreciate the discussion we have had.

lukovnikov commented 1 year ago

I see

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers