"Safety Module" will be used to *create* NSFW images

rolux commented 2 years ago

Describe the bug

You have added a "Safety Module" that returns black images for NSFW content.

https://github.com/huggingface/diffusers/commit/65ea7d6b628d9b4c0fedf59bb59e7e57c88a44ff

It is obvious that this classifier will be used for a different purpose. It is trivial to change the module so that Stable Diffusion only outputs NSFW images.

Unless you're sure this is a good idea, I would suggest to remove this module.

Reproduction

No response

Logs

No response

System Info

n/a

atarashansky commented 2 years ago

I don’t think that’s how the safety checker works. After a cursory look, it seems like it is just adding a filter at the end of the pipeline to check if there are any NSFW concepts. If there are, the image is blacked out. It is not a generative model. I’m sure people don’t need its help in figuring out if an input is nsfw or not, so I don’t see how adding this safety checker will make things worse.

rolux commented 2 years ago

@atarashansky: Yes, I am aware of how the safety checker works. And I agree, people are able to make their own judgements about what is NSFW and what is not. This safety module, however, can trivially be altered to intake large amounts of SD-generated images and output only the NSFW ones, without human intervention.

mallorbc commented 2 years ago

How does the filter even work?

We have the clip model and a linear projection layer here to project onto

        self.vision_model = CLIPVisionModel(config.vision_config)
        self.visual_projection = nn.Linear(config.vision_config.hidden_size, config.projection_dim, bias=False)

We have two matrices defined here filled with ones

        self.concept_embeds = nn.Parameter(torch.ones(17, config.projection_dim), requires_grad=False)
        self.special_care_embeds = nn.Parameter(torch.ones(3, config.projection_dim), requires_grad=False)

We then take the ClipFeature embeddings from the generated image and put them through the ClipVisionModel and then linear project them.

        pooled_output = self.vision_model(clip_input)[1]  # pooled_output
        image_embeds = self.visual_projection(pooled_output)

The rest of the algorithm relies on the values calculated with cosine similarity with the matrices of 1s so there should not be any useful information that I see.

        special_cos_dist = cosine_distance(image_embeds, self.special_care_embeds).cpu().numpy()
        cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu().numpy()

I would appreciate any insight on this, as it has obvious positive applications outside of this model for content moderation.

With regards to keeping or removing it, I don't believe it matters due to its open-source nature. Anyone moderately proficient at python could remove the feature. One only needs to look briefly on the internet, less than 24 hours after the model was released, to see that.

One positive thing to be had by keeping the filter is perhaps to limit the bad side of this model to those who are less likely to abuse it? There are also probably legal reasons to keep it. It's a very powerful tool and we are in uncharted areas so perhaps it is best to keep it in place if it works well.

rolux commented 2 years ago

Given that Stable Diffusion struggles with human anatomy, and the "safety module" flags mostly false positives anyway, it's probably not such a big concern.

huggingface / diffusers