Closed John1231983 closed 3 years ago
Yes, and that's the process I posted comparing with the baseline using arcface
only.
distiller_loss_euclidean
/ distiller_loss_cosine
, losses.py#L374alpha
value is not certain, as I changed distill_loss
, it's now 128
equals the previous 64
...I'm now testing some other strategies:
distiller_loss_euclidean
/ distiller_loss_cosine
only, it's already implemented in the code, but I'm not sure about the results.embedding shape == 512
to distill a student model embedding shape == 256
.About the hard-sample mining:
I tested arcface
+ distiller_loss_cosine
, using dataset mined by the teacher model, but in a way that drop those images not good in the single class, but the result on CASIA is not improving:
from tqdm import tqdm
from sklearn.preprocessing import normalize
def pick_by_emb_dists(min_dist, image_classes, embeddings):
picks = np.zeros_like(image_classes).astype('bool')
for pick_class in tqdm(np.unique(image_classes)):
class_emb = normalize(embeddings[image_classes == pick_class])
dists = np.dot(class_emb, class_emb.T)
base_idx = np.sum(dists > min_dist, axis=-1).argmax()
base_dist = dists[base_idx]
picks[image_classes == pick_class] = base_dist > min_dist
# print(pick_class, base_idx, base_dist.min(), base_dist.max(), np.sum(base_dist > min_dist), "/", class_emb.shape[0])
print("Picks left:", picks.sum(), "/", picks.shape[0])
return picks
aa = np.load("faces_casia_112x112_folders_shuffle_label_embs_512.npz")
image_names, image_classes, embeddings = aa["image_names"], aa["image_classes"], aa["embeddings"]
picks = pick_by_emb_dists(0.3, image_classes, embeddings)
np.savez("a_new_dataset_name", image_names=image_names[picks], image_classes=image_classes[picks], embeddings=embeddings[picks])
distill_loss
with triplet loss
, if teacher_model_interf
is provided train.py#L40. It will extract teacher embedding data online data.py#L120, but the efficiency is not tested...Thanks so much for your information. For dropout, you means you are using it for student model, right? For distillation, have you try logits consistency loss? it means we also constrain the prediction of teacher and student should be same
MXNet
/ pytorch
not containing the output fc7
layer.This figure shows some of my results, using MXNet r100
as teacher model, training mobilenet
on CASIA
dataset, with different optimizer SGDW
/ AdamW
+ different losses, detailed in the labels.
Nice work for distillation. It improved with a big number comparison with baseline. For clear what you are doing for distillation. Could I ask some question related to the pipeline of distillation?
The pipeline that I understand likes
Any step what I am missing for your distillation? Have you use hard-sample mining for distillation loss?