Open aleksusklim opened 1 year ago
All right, I tried to do this on my own, and came up with this dirty script:
# clipcolors.py
import modules.scripts as scripts
from modules import shared
from modules.processing import process_images
class Script(scripts.Script):
def title(self):
return "clipcolors"
def ui(self, is_img2img):
return []
def run(self, p):
clip = shared.sd_model.cond_stage_model
encoder = clip.wrapped.transformer.text_model.encoder
pos = True
h = encoder.forward
def H(*ar,**kw):
nonlocal pos
if pos:
pos = False
return h(*ar,**kw)
pos = True
inputs_embeds = kw['inputs_embeds']
E = inputs_embeds[0]
a = 0
b = 0
c = None
def G(f):
y = None
z = None
def F(X,*ar,**kw):
nonlocal a,b,c,y,z
R = f(X,*ar,**kw)
r = R[0][0]
x = X[0]
if c is None:
y = r.clone()
z = r.clone()
elif c:
r[:a,:] = y[:a,:]
r[b:,:] = y[b:,:]
z[a:b,:] = r[a:b,:]
else:
r[:,:] = z[:,:]
return R
return F
arr = [
(14,14,16),
(17,17,19),
(20,20,22),
(23,23,25),
(26,26,28),
(29,29,31),
(32,32,34),
(35,35,37),
]
e = E.clone()
for P in arr:
E[P[0],:] = 0.0
layers = encoder.layers
for i in range(len(layers)):
f = layers[i].forward
F = G(f)
F._f_ = f
layers[i].forward = F
try:
h(*ar,**kw)
c = True
for P in arr:
p = P[0]
E[p,:] = e[p,:]
a = P[1]
b = P[2]
h(*ar,**kw)
E[p,:] = 0.0
c = False
r = h(*ar,**kw)
finally:
for i in range(len(layers)):
layers[i].forward = layers[i].forward._f_
return r
encoder.forward = H
try:
proc = process_images(p)
finally:
encoder.forward = h
return proc
#EOF
(I didn't test it well; it might leak memory or leave the model broken; it is better to always restart WebUI just to be sure that nothing left from previous runs).
Actual token positions currently are not exported to UI, I set them as constant array in the code, tuned for this exact prompt:
full-body photo, beautiful girl is sitting on the floor, red eyes, green shirt, yellow skirt, blue shoes, white hair, black background, orange gloves, purple light, best quality, masterpiece
Algorithm is:
I am not happy with its effect! At actually as good and as bad as your very cutoff with weight=1 and "cutoff strongly". No clear additional benefits…
For example, this is my test (model: suzumehachi, seed: 2277229613, negative: cropped, out-of-frame, bad anatomy, worst quality
):
Original:
Without restoring target tokens:
My main result with restored tokens:
You cutoff with default settings gives this when targeted at
red, green, yellow, blue, white, black, orange, purple
With Cutoff strong and Weight 1, it gives:
And this one is for weight = 2:
For me, it is more or less the same thing. My method doesn't add anything valuable for preventing color shifts.
But now I have another idea!
Will this help U-Net to not shift color? This way, not only Clip will process "red" without knowing anything about "green" or other colors, but U-Net will also attend to "red" on the same regions where it attends to "eyes" but not "shirt" or anything else.
Let me explain. As I understand, you: 1) Split prompt at each comma, if there is a target term between. 2) For each group create two its versions: original and replaced with target token. 3) Transform everything several times with CLIP. 4) Somehow combine target arrays so that each group mainly receives its original part while "sees" other groups as replaced.
What if we do this not only for the last clip layer (known as ClipSkip 1), but for each layer available? (Yes, it will work only for SD1 in this case, but it's worth trying!)
I propose something like this: 1) Replace every term, send to CLIP and extract each value at each layer (10x77x768, or so). 2) For each group, freeze all CLIP weights except for the current group, for all layers. 3) When CLIP will transform each part, it should not rewrite any frozen number which belongs to other groups (this can be imagined as "inpainting" all the way from the first to the last layer but only for small subset of vector embeddings).
(Personally, I don't know how technically hook/replace CLIP's behavior, but theoretically it should be possible).
In this scenario, there would be no single bit of color information leave its group! Though, the composition might change severely (closely resembling that with already replaced terms), and the colors may not play nicely with each other (or being washed-out), but we need to see it ourselves.
What do you think?