Distillation discussion

John1231983 commented 3 years ago

Nice work for distillation. It improved with a big number comparison with baseline. For clear what you are doing for distillation. Could I ask some question related to the pipeline of distillation?

The pipeline that I understand likes

Step 1: Teacher feature extraction

image -->teacher_backbone (eval mode, no_grad)-->save embed_eacher feature (l2_norm) to numpy with its index (emb_teacher)

Step 2: Student feature extraction

image -->student_backbone (lr 0.1, training mode)--> embed_student feature

Step 3: Compute two losses:

L2 = MSE (embed_teacher, l2_norm(embed_student) # Same index image
L_class = Arcface (embed_student, labels)
Loss_total = alpha *L2 + L_class # alpha=64

Any step what I am missing for your distillation? Have you use hard-sample mining for distillation loss?

leondgarse commented 3 years ago

Yes, and that's the process I posted comparing with the baseline using arcface only.

Step1, my saved embed_teacher feature is the original value now, not normed data_distiller.py#L74
Step 3, I'm testing two distillation loss function, distiller_loss_euclidean / distiller_loss_cosine, losses.py#L374
The alpha value is not certain, as I changed distill_loss, it's now 128 equals the previous 64...

I'm now testing some other strategies:

Using distiller_loss_euclidean / distiller_loss_cosine only, it's already implemented in the code, but I'm not sure about the results.
Using teacher model with embedding shape == 512 to distill a student model embedding shape == 256.

About the hard-sample mining:

I tested arcface + distiller_loss_cosine, using dataset mined by the teacher model, but in a way that drop those images not good in the single class, but the result on CASIA is not improving:

from tqdm import tqdm
from sklearn.preprocessing import normalize
def pick_by_emb_dists(min_dist, image_classes, embeddings):
  picks = np.zeros_like(image_classes).astype('bool')
  for pick_class in tqdm(np.unique(image_classes)):
      class_emb = normalize(embeddings[image_classes == pick_class])
      dists = np.dot(class_emb, class_emb.T)
      base_idx = np.sum(dists > min_dist, axis=-1).argmax()
      base_dist = dists[base_idx]
      picks[image_classes == pick_class] = base_dist > min_dist
      # print(pick_class, base_idx, base_dist.min(), base_dist.max(), np.sum(base_dist > min_dist), "/", class_emb.shape[0])
  print("Picks left:", picks.sum(), "/", picks.shape[0])
  return picks

aa = np.load("faces_casia_112x112_folders_shuffle_label_embs_512.npz")
image_names, image_classes, embeddings = aa["image_names"], aa["image_classes"], aa["embeddings"]
picks = pick_by_emb_dists(0.3, image_classes, embeddings)
np.savez("a_new_dataset_name", image_names=image_names[picks], image_classes=image_classes[picks], embeddings=embeddings[picks])

We can also combine distill_loss with triplet loss, if teacher_model_interf is provided train.py#L40. It will extract teacher embedding data online data.py#L120, but the efficiency is not tested...

John1231983 commented 3 years ago

Thanks so much for your information. For dropout, you means you are using it for student model, right? For distillation, have you try logits consistency loss? it means we also constrain the prediction of teacher and student should be same

Fourth, I want to ask one more question, have you try to finetune student by copy backbone pretrained weight from its student and head and distill on the weight, or train the student from scratch?

leondgarse commented 3 years ago

For dropout, you mean the result I posted in Knowledge distillation training Mobilenet on CASIA? yes, it's used for student model.
No, I haven't try logits consistency loss, as most high accuracy model provided by MXNet / pytorch not containing the output fc7 layer.
The student model is trained from scratch, as I'm comparing the efficiency there.

This figure shows some of my results, using MXNet r100 as teacher model, training mobilenet on CASIA dataset, with different optimizer SGDW / AdamW + different losses, detailed in the labels.

Selection_298

leondgarse / Keras_insightface

Distillation discussion #27