Insightface Keras implementation
Is there any update plan for Adaface? #91

whalefa1I commented 2 years ago

FYI https://paperswithcode.com/paper/adaface-quality-adaptive-margin-for-face

leondgarse commented 2 years ago

I will take a check, thanks for the reminding. Just currently occupied with something else. For this project, I'm trying partialFC on Glint360K recently, and it took me a long training time...

whalefa1I commented 2 years ago

我看他们训练clip多模态预训练模型的时候有的人用过 Gradient Accumulation,你看看有没有能帮到你的

leondgarse commented 2 years ago
leondgarse commented 2 years ago

在做了在做了,已经把 webface4m 和 webface12m 的 r100 模型转化出来了,跑跑验证就开始写损失函数了。

leondgarse commented 2 years ago

AdaFaceLoss 更新了:

leondgarse commented 2 years ago

对的,Converted MagFace / AdaFace r50 / r100 model and face quality testing #57 这里就是用 norm 值作为人脸质量值在 cfp_fp / agedb_30 上的测试

leondgarse commented 2 years ago

人脸质量测试感觉不如 magface,还得看看论文人家是怎么用的。Readme 里 EffV2S,MagFace 这个是自己训练的,质量测试效果看起来还可以

leondgarse commented 2 years ago

另外还有一个使用 MagFace 结果再做人脸质量训练的 QMagFace: Simple and Accurate Quality-Aware Face Recognition

leondgarse commented 2 years ago
1、torch模型怎么移植成tf的格式嘞,需要换框架复现代码重新训练,还是把权重拿出来就行了 2、有没有什么方法可以固定两个框架的随机初始化数值,判断复现结果是否一致

leondgarse commented 2 years ago
  1. 直接转化权重,具体的过程在 Atom_notebook/adaface-model,使用的是我另一个项目 keras_cv_attention_models
  2. 随机初始化的数值一般可以固定随机的 seed,或者也可以用 pytorch 初始化好权重,然后用 1 的方式转化成 keras 模型,但很多其他问题,比如 SGD 的 weight_decay 方式不同,Adaface 训练过程中添加了一些随机裁剪 / 随机质量的强化等等,单纯固定初始化数值不能保证完全复现训练过程。
  3. 之前跑的训练一直用的是 Adamw,EfficientNetV2S + adamw / r100 + adamw,前几天刚刚发现 adamw 训练的一些问题,会使 batch_norm 的 moving_variance 变的很大,可能是这个原因导致了 loss=nan,正在用 sgd / sgdw 重新跑。
你 是 我的神

Traceback (most recent call last):
  File "/data/xixi/project/Github/Keras_insightface/torch_model_conversion.py", line 21, in <module>
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 311, in keras_reload_from_torch_model
    keras_reload_stacked_state_dict(keras_model, stacked_state_dict, aligned_names, additional_transfer, save_name=save_name)
  File "/data/xixi/project/Github/Keras_insightface/download_and_load.py", line 166, in keras_reload_stacked_state_dict
    torch_weight[0] = np.transpose(torch_weight[0], (2, 3, 1, 0))
  File "<__array_function__ internals>", line 180, in transpose
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 660, in transpose
    return _wrapfunc(a, 'transpose', axes)
  File "/home/nlp/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
ValueError: axes don't match array

具体的在'stack1_block1_shortcut_conv'在这层里面维度不是普通卷积,没法进行transpose stack1_block1_shortcut_conv


mm = models.buildin_models('r100', output_layer='E', activation="PReLU", bn_momentum=0.9, bn_epsilon=1e-5, use_bias=True, scale=False, use_max_pool=True)

tail_align_dict = {"shortcut_conv": -4, "shortcut_bn": -5}
full_name_align_dict = {"E_batchnorm": 3, "E_dense": 4, "pre_embedding": 5}
# [25088, 512] -> CHW + out [512, 7, 7, 512] -> HWC + out [7, 7, 512, 512] -> [25088, 512]
    "E_dense": lambda ww: [ww[0].reshape(512, 7, 7, 512).transpose([1, 2, 0, 3]).reshape([-1, 512]), ww[1]],
    "pre_embedding": lambda ww: [np.zeros(512), *ww],
    input_shape=(112, 112),
leondgarse commented 2 years ago

你的 Keras_insightface/backbones/resnet.py 没有更新吧,指定 use_max_pool=Truestack_1_block_1 没有 shortcut_conv,就是 adaface 用的 resnet 结构,更新一下你的 Keras_insightface/backbones/resnet.py

y_true = tf.one_hot(tf.random.uniform([32], 1, 10, dtype='int32'), 10)
y_pred = tf.random.uniform([32, 10])
y_pred_norm = tf.concat([y_pred, tf.norm(y_pred, axis=-1, keepdims=True)], axis=-1)
import losses
aa = losses.AdaFaceLoss()
print(aa(y_true, y_pred_norm))

import torch
import head
from torch.nn import CrossEntropyLoss
bb = head.AdaFace(embedding_size=10, classnum=32)
cc = bb(torch.from_numpy(y_pred_norm[:, :-1].numpy()), torch.from_numpy(y_pred_norm[:, -1:].numpy()), torch.from_numpy(np.argmax(y_true, axis=-1)))
cross_entropy_loss = CrossEntropyLoss()
loss = cross_entropy_loss(cc, torch.from_numpy(np.argmax(y_true, axis=-1)))

是不是因为源码head刚进来做了个随机初始化的全连接,算出margin loss的cosine。keras的loss进来之前你已经在上面做过norm了,所以不具备数值意义上的可比性。如果观察backprop的复现效果的话,一般是大概差不多,符合论文说明,收敛就行了,还是要像个办法严格控制数值呢

  • AdaFace head.py#L72 的实现 safe_norms = safe_norms.clone().detach() 感觉应该将整个 margin 计算的过程放到 tf.stop_gradient 中,这样也符合论文中描述 Gradient doesn't flow to ∥zi∥
norm_mean = tf.stop_gradient(tf.math.reduce_mean(feature_norm))
samples = tf.cast(tf.maximum(1, feature_norm.shape[0] - 1), feature_norm.dtype)
norm_std = tf.stop_gradient(tf.sqrt(tf.math.reduce_sum((feature_norm - norm_mean) ** 2) / samples))  # Torch std
self.batch_mean.assign(self.mean_std_alpha * norm_mean + (1.0 - self.mean_std_alpha) * self.batch_mean)
self.batch_std.assign(self.mean_std_alpha * norm_std + (1.0 - self.mean_std_alpha) * self.batch_std)

具体有啥地方需要改吗,感觉没啥差别哎。是两个框架stop gradient的逻辑不一样吗

leondgarse commented 2 years ago


with torch.no_grad():
        mean = safe_norms.mean().detach()
        std = safe_norms.std().detach()
        self.batch_mean = mean * self.t_alpha + (1 - self.t_alpha) * self.batch_mean
        self.batch_std =  std * self.t_alpha + (1 - self.t_alpha) * self.batch_std

        margin_scaler = (safe_norms - self.batch_mean) / (self.batch_std+self.eps) # 66% between -1, 1
        margin_scaler = margin_scaler * self.h # 68% between -0.333 ,0.333 when h:0.333
        margin_scaler = torch.clip(margin_scaler, -1, 1)


leondgarse commented 2 years ago

我对 pytorch 没那么熟悉,根据一些文章来看,比如 Difference between detach().clone() and clone().detach(),我认为 safe_norms = safe_norms.clone().detach() 与将 safe_norm 相关的所有计算放到 torch.no_grad 里面应该是等效的,使用 clone().detach() 这种方式可能是更确保截断了梯度。

leondgarse commented 2 years ago

这个解释的更好点 Detach, no_grad and requires_grad

leondgarse commented 2 years ago

目前的结果看起来还不错,r50 + SGD + AdaFace 53 epochs:

import losses, train, models
import tensorflow_addons as tfa

data_basic_path = '/datasets/ms1m-retinaface-t1'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

basic_model = models.buildin_models('r50', dropout=0.4, emb_shape=512, output_layer='E', bn_momentum=0.9, bn_epsilon=1e-5, scale=True, use_bias=False, activation='prelu', use_max_pool=True)
basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)

tt = train.Train(data_path, eval_paths=eval_paths,
    basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-6, lr_warmup_steps=3,
    batch_size=512, random_status=100, eval_freq=4000, output_weight_decay=1)

# optimizer = tfa.optimizers.AdamW(learning_rate=1e-2, weight_decay=5e-4, exclude_from_weight_decay=["/gamma", "/beta"])
# optimizer = tfa.optimizers.SGDW(learning_rate=1e-2, weight_decay=5e-6, momentum=0.9, exclude_from_weight_decay=["/gamma", "/beta"])
optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
sch = [
    {"loss": losses.AdaFaceLoss(scale=64), "epoch": 53, "optimizer": optimizer},
tt.train(sch, 0)
r50_sgd_adaface 1e-06 1e-05 0.0001 0.001 0.01 0.1 AUC
r50 IJBB 0.393379 0.91334 0.955501 0.970204 0.978773 0.986465 0.993366
r50 IJBC 0.888633 0.952702 0.969269 0.979496 0.985734 0.991052 0.995485
PyTorch 训练 26 epochs 的结果: Arch Dataset Method IJBB TAR@FAR=0.01% IJBC TAR@FAR=0.01%
R50 WebFace4M AdaFace 95.44 96.98
R50 MS1MV2 AdaFace 94.82 96.27
leondgarse commented 2 years ago

大概看了下,感觉速度不快的样子,单张图片 get_scaled_quality 调用 100 次前向,get_gradients 再调用反向,感觉不太好跟目前的实现集成

leondgarse commented 2 years ago

Adaface + r100 的训练结果这几天应该会上传,53 Epochs 的结果是 IJBB 0.961636,IJBC 0.972849,相对于PyTorch的26 Epochs IJBB 95.84, IJBC 97.09

leondgarse commented 2 years ago

r100 的训练结果上传了,可以作为训练 ghostnet 的参考