Is there any update plan for Adaface？

whalefa1I commented 2 years ago

FYI https://paperswithcode.com/paper/adaface-quality-adaptive-margin-for-face

leondgarse commented 2 years ago

I will take a check, thanks for the reminding. Just currently occupied with something else. For this project, I'm trying partialFC on Glint360K recently, and it took me a long training time...

whalefa1I commented 2 years ago

我看他们训练clip多模态预训练模型的时候有的人用过 Gradient Accumulation，你看看有没有能帮到你的

leondgarse commented 2 years ago

也是不错的，可以等等应该会加到 tensorflow-addons 里面 Gradient accumulate optimizer #2260.
目前 partialFC 的实现结果更新到了 Training on large datasets with a lot of identities #90.

leondgarse commented 2 years ago

在做了在做了，已经把 webface4m 和 webface12m 的 r100 模型转化出来了，跑跑验证就开始写损失函数了。

leondgarse commented 2 years ago

AdaFaceLoss 更新了：

两个转化的模型 r00 webface4m 和 r100 webface12m
AdaFaceLoss 只是跑了几个 batch 确认训练中 loss 会收敛，还没有完整训练

whalefa1I commented 2 years ago

adaface和magface都有个反映人脸图片质量的方式，做个norm啥的就行，你有注意过吗

leondgarse commented 2 years ago

对的，Converted MagFace / AdaFace r50 / r100 model and face quality testing #57 这里就是用 norm 值作为人脸质量值在 cfp_fp / agedb_30 上的测试

whalefa1I commented 2 years ago

你这手也太快了，有模型吗，我自己训练的感觉没啥效果啊，为啥模糊的比好的分还高

whalefa1I commented 2 years ago

哦刷新看到模型了，我试试

leondgarse commented 2 years ago

人脸质量测试感觉不如 magface，还得看看论文人家是怎么用的。Readme 里 EffV2S,MagFace 这个是自己训练的，质量测试效果看起来还可以

leondgarse commented 2 years ago

另外还有一个使用 MagFace 结果再做人脸质量训练的 QMagFace: Simple and Accurate Quality-Aware Face Recognition

leondgarse commented 2 years ago

从论文中的附件 B.1. Correlation between Norm and BRISQUE during Training 感觉 adaface 的 norm 值不能用来判断人脸质量
AdaFace head.py#L72 的实现 safe_norms = safe_norms.clone().detach() 感觉应该将整个 margin 计算的过程放到 tf.stop_gradient 中，这样也符合论文中描述 Gradient doesn't flow to ∥zi∥
原来上传的模型是从官方移植过来的，使用的是 BGR 输入，重新上传了两个 adaface_ir101_webface*m_rgb.h5，使用 RGB 输入的，修正了验证数据集上的准确度
先不要使用当前的 AdaFace 实现吧，需要跑一下训练验证一下

whalefa1I commented 2 years ago

1、torch模型怎么移植成tf的格式嘞，需要换框架复现代码重新训练，还是把权重拿出来就行了 2、有没有什么方法可以固定两个框架的随机初始化数值，判断复现结果是否一致

leondgarse commented 2 years ago

直接转化权重，具体的过程在 Atom_notebook/adaface-model，使用的是我另一个项目 keras_cv_attention_models。
随机初始化的数值一般可以固定随机的 seed，或者也可以用 pytorch 初始化好权重，然后用 1 的方式转化成 keras 模型，但很多其他问题，比如 SGD 的 weight_decay 方式不同，Adaface 训练过程中添加了一些随机裁剪 / 随机质量的强化等等，单纯固定初始化数值不能保证完全复现训练过程。
之前跑的训练一直用的是 Adamw，EfficientNetV2S + adamw / r100 + adamw，前几天刚刚发现 adamw 训练的一些问题，会使 batch_norm 的 moving_variance 变的很大，可能是这个原因导致了 loss=nan，正在用 sgd / sgdw 重新跑。

whalefa1I commented 2 years ago

你是我的神

leondgarse commented 2 years ago

倒是也不必

whalefa1I commented 2 years ago

我把keras_cv_attention_models里面的download_and_load和test_images放到项目里，然后把adaface里面的net和head也放进去，下的ckpt就是“adaface_ir101_webface4m.ckpt”，但是我在convert的时候报错了

====================
stack1_block1_shortcut_conv
Traceback (most recent call last):
  File "/data/xixi/project/Github/Keras_insightface/torch_model_conversion.py", line 21, in <module>
    download_and_load.keras_reload_from_torch_model(
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 311, in keras_reload_from_torch_model
    keras_reload_stacked_state_dict(keras_model, stacked_state_dict, aligned_names, additional_transfer, save_name=save_name)
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 166, in keras_reload_stacked_state_dict
    torch_weight[0] = np.transpose(torch_weight[0], (2, 3, 1, 0))
  File "<__array_function__ internals>", line 180, in transpose
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 660, in transpose
    return _wrapfunc(a, 'transpose', axes)
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
ValueError: axes don't match array

具体的在'stack1_block1_shortcut_conv'在这层里面维度不是普通卷积，没法进行transpose

还需要改啥吗

whalefa1I commented 2 years ago

用的代码是：

mm = models.buildin_models('r100', output_layer='E', activation="PReLU", bn_momentum=0.9, bn_epsilon=1e-5, use_bias=True, scale=False, use_max_pool=True)

tail_align_dict = {"shortcut_conv": -4, "shortcut_bn": -5}
full_name_align_dict = {"E_batchnorm": 3, "E_dense": 4, "pre_embedding": 5}
# [25088, 512] -> CHW + out [512, 7, 7, 512] -> HWC + out [7, 7, 512, 512] -> [25088, 512]
additional_transfer={
    "E_dense": lambda ww: [ww[0].reshape(512, 7, 7, 512).transpose([1, 2, 0, 3]).reshape([-1, 512]), ww[1]],
    "pre_embedding": lambda ww: [np.zeros(512), *ww],
}
download_and_load.keras_reload_from_torch_model(
    'adaface_ir101_webface4m.ckpt',
    keras_model=mm,
    tail_align_dict=tail_align_dict,
    full_name_align_dict=full_name_align_dict,
    additional_transfer=additional_transfer,
    input_shape=(112, 112),
    do_convert=True,
    save_name="adaface_ir101_webface4m.h5",
)

leondgarse commented 2 years ago

你的 Keras_insightface/backbones/resnet.py 没有更新吧，指定 use_max_pool=True 时 stack_1_block_1 没有 shortcut_conv，就是 adaface 用的 resnet 结构，更新一下你的 Keras_insightface/backbones/resnet.py。

whalefa1I commented 2 years ago

y_true = tf.one_hot(tf.random.uniform([32], 1, 10, dtype='int32'), 10)
y_pred = tf.random.uniform([32, 10])
y_pred_norm = tf.concat([y_pred, tf.norm(y_pred, axis=-1, keepdims=True)], axis=-1)
import losses
aa = losses.AdaFaceLoss()
print(aa(y_true, y_pred_norm))

import torch
import head
from torch.nn import CrossEntropyLoss
bb = head.AdaFace(embedding_size=10, classnum=32)
cc = bb(torch.from_numpy(y_pred_norm[:, :-1].numpy()), torch.from_numpy(y_pred_norm[:, -1:].numpy()), torch.from_numpy(np.argmax(y_true, axis=-1)))
cross_entropy_loss = CrossEntropyLoss()
loss = cross_entropy_loss(cc, torch.from_numpy(np.argmax(y_true, axis=-1)))
print(loss)

是不是因为源码head刚进来做了个随机初始化的全连接，算出margin loss的cosine。keras的loss进来之前你已经在上面做过norm了，所以不具备数值意义上的可比性。如果观察backprop的复现效果的话，一般是大概差不多，符合论文说明，收敛就行了，还是要像个办法严格控制数值呢

leondgarse commented 2 years ago

啊，你说这个，这个对比测试需要改一下代码：

PyTorch 的 head.py 使用输入的 embbedings 直接作为 cosine 值

65     def forward(self, embbedings, norms, label):
66
67         # kernel_norm = l2_norm(self.kernel,axis=0)
68         # cosine = torch.mm(embbedings,kernel_norm)
69         # cosine = cosine.clamp(-1+self.eps, 1-self.eps) # for stability
70         cosine = embbedings

Keras 的 losses.py 中 AdaFaceLoss 408 行去掉注释 return arcface_logits，直接返回 arcface_logits

408        return arcface_logits
409        # return tf.keras.losses.categorical_crossentropy(y_true, arcface_logits, from_logits=self.from_logits, label_smoothing=self.label_smoothing)

测试

y_true = tf.one_hot(tf.random.uniform([32], 1, 10, dtype='int32'), 10)
y_pred = tf.random.uniform([32, 10])
y_pred_norm = tf.concat([y_pred, tf.norm(y_pred, axis=-1, keepdims=True)], axis=-1)
import losses
aa = losses.AdaFaceLoss()
aa(y_true, y_pred_norm)

sys.path.append('../AdaFace-master/')
import torch
import head
bb = head.AdaFace(t_alpha=0.01)
cc = bb(torch.from_numpy(y_pred_norm[:, :-1].numpy()), torch.from_numpy(y_pred_norm[:, -1:].numpy()), torch.from_numpy(np.argmax(y_true, axis=-1)))

print(f"{aa(y_true, y_pred_norm).numpy() = }, {cc.mean() = }")
# aa(y_true, y_pred_norm).numpy() = 30.912012, cc.mean() = tensor(30.9092)

去掉 scale 放大的 64 倍的话，两个值基本相同

print(f"{aa(y_true, y_pred_norm).numpy() / 64 = }, {cc.mean() / 64 = }")
# aa(y_true, y_pred_norm).numpy() / 64 = 0.4830001890659332, cc.mean() / 64 = tensor(0.4830)

leondgarse commented 2 years ago

你上面的模型转化成功了吗？

whalefa1I commented 2 years ago

成功惹！应该就是shortcut的原因

whalefa1I commented 2 years ago

AdaFace head.py#L72 的实现 safe_norms = safe_norms.clone().detach() 感觉应该将整个 margin 计算的过程放到 tf.stop_gradient 中，这样也符合论文中描述 Gradient doesn't flow to ∥zi∥

norm_mean = tf.stop_gradient(tf.math.reduce_mean(feature_norm))
samples = tf.cast(tf.maximum(1, feature_norm.shape[0] - 1), feature_norm.dtype)
norm_std = tf.stop_gradient(tf.sqrt(tf.math.reduce_sum((feature_norm - norm_mean) ** 2) / samples))  # Torch std
self.batch_mean.assign(self.mean_std_alpha * norm_mean + (1.0 - self.mean_std_alpha) * self.batch_mean)
self.batch_std.assign(self.mean_std_alpha * norm_std + (1.0 - self.mean_std_alpha) * self.batch_std)

具体有啥地方需要改吗，感觉没啥差别哎。是两个框架stop gradient的逻辑不一样吗

leondgarse commented 2 years ago

更新了，因为训练还没有跑完，之前这部分没有更新

whalefa1I commented 2 years ago

按这样的话，等价的pytorch是不是

with torch.no_grad():
        mean = safe_norms.mean().detach()
        std = safe_norms.std().detach()
        self.batch_mean = mean * self.t_alpha + (1 - self.t_alpha) * self.batch_mean
        self.batch_std =  std * self.t_alpha + (1 - self.t_alpha) * self.batch_std

        margin_scaler = (safe_norms - self.batch_mean) / (self.batch_std+self.eps) # 66% between -1, 1
        margin_scaler = margin_scaler * self.h # 68% between -0.333 ,0.333 when h:0.333
        margin_scaler = torch.clip(margin_scaler, -1, 1)

还是说torch放外面就可以

leondgarse commented 2 years ago

我对 pytorch 没那么熟悉，根据一些文章来看，比如 Difference between detach().clone() and clone().detach()，我认为 safe_norms = safe_norms.clone().detach() 与将 safe_norm 相关的所有计算放到 torch.no_grad 里面应该是等效的，使用 clone().detach() 这种方式可能是更确保截断了梯度。

leondgarse commented 2 years ago

这个解释的更好点 Detach, no_grad and requires_grad

whalefa1I commented 2 years ago

感觉是的，应该就是双保险的意思,或者最多是torch.no_grad做了内存优化，快一些。

leondgarse commented 2 years ago

目前的结果看起来还不错，r50 + SGD + AdaFace 53 epochs:

import losses, train, models
import tensorflow_addons as tfa
keras.mixed_precision.set_global_policy("mixed_float16")

data_basic_path = '/datasets/ms1m-retinaface-t1'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

basic_model = models.buildin_models('r50', dropout=0.4, emb_shape=512, output_layer='E', bn_momentum=0.9, bn_epsilon=1e-5, scale=True, use_bias=False, activation='prelu', use_max_pool=True)
basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)

tt = train.Train(data_path, eval_paths=eval_paths,
    save_path='TT_r50_max_pool_E_prelu_dr04_lr_01_l2_5e4_adaface_emb512_sgd_m09_bs512_ms1m_64_only_margin_SG_scale_true_bias_false_random_100.h5',
    basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-6, lr_warmup_steps=3,
    batch_size=512, random_status=100, eval_freq=4000, output_weight_decay=1)

# optimizer = tfa.optimizers.AdamW(learning_rate=1e-2, weight_decay=5e-4, exclude_from_weight_decay=["/gamma", "/beta"])
# optimizer = tfa.optimizers.SGDW(learning_rate=1e-2, weight_decay=5e-6, momentum=0.9, exclude_from_weight_decay=["/gamma", "/beta"])
optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
sch = [
    {"loss": losses.AdaFaceLoss(scale=64), "epoch": 53, "optimizer": optimizer},
]
tt.train(sch, 0)

		1e-06	1e-05	0.0001	0.001	0.01	0.1	AUC
r50 IJBB	0.393379	0.91334	0.955501	0.970204	0.978773	0.986465	0.993366
r50 IJBC	0.888633	0.952702	0.969269	0.979496	0.985734	0.991052	0.995485

PyTorch 训练 26 epochs 的结果：	Arch	Dataset	Method	IJBB TAR@FAR=0.01%	IJBC TAR@FAR=0.01%
R50	WebFace4M	AdaFace	95.44	96.98
R50	MS1MV2	AdaFace	94.82	96.27

whalefa1I commented 2 years ago

这个还挺好玩得，感觉接在模型后面就行

leondgarse commented 2 years ago

大概看了下，感觉速度不快的样子，单张图片 get_scaled_quality 调用 100 次前向，get_gradients 再调用反向，感觉不太好跟目前的实现集成

leondgarse commented 2 years ago

Adaface + r100 的训练结果这几天应该会上传，53 Epochs 的结果是 IJBB 0.961636，IJBC 0.972849，相对于PyTorch的26 Epochs IJBB 95.84, IJBC 97.09

whalefa1I commented 2 years ago

恭喜啊！！！tql！！！那我只用ghostnet就够了！注意身体嗷

leondgarse commented 2 years ago

r100 的训练结果上传了，可以作为训练 ghostnet 的参考

leondgarse / Keras_insightface

Is there any update plan for Adaface？ #91