How to train on new generic object class

arkeys commented 6 years ago

How was faced trained on the the WIDER FACE dataset and how was the dataset prepared?

iitzco commented 6 years ago

Hi @arkeys

I will try to upload all training scripts soon. However, let me explain briefly how I trained faced:

1°) The WIDERFACE dataset was transformed to tfrecord format using this repo.

2°) For each example (that consisted in an image with several [x0, x1, y0, y1] bounding boxes, I prepared the following:

object_matrix: a 9x9 matrix containing 1 if the cell contained the center point of a face, 0 otherwise.
center_x_matrix: a 9x9 matrix containing the cell's relative value of the x center of a face in those cells containing the center of a face. 0 otherwise. For example, if the center of a face is in {x: 0.5, y: 0.5} (normalized to the image size), then the center_x_matrix would contain a 0.5 in the element center_x_matrix[4][4]. Recall that all values in this matrix are relative to the cell.
center_y_matrix: a 9x9 matrix containing the cell's relative value of the y center of a face in those cells containing the center of a face. 0 otherwise.
w_matrix: a 9x9 matrix containing the normalized width value of a face in those cells containing the center of a face. 0 otherwise.
h_matrix: a 9x9 matrix containing the normalized height value of a face in those cells containing the center of a face. 0 otherwise.

This preparation steps are similar on those YOLO make. You can read more about these steps here

3°) Horizontal flips to all examples where done to augment the dataset.

4°) For each forward pass, a 9x9x5 tensor was obtained (after a series of convolutions/pooling/dropout/BN layers):

The first component (that is [:,:,0]) was the probability of that cell of containing the center of a face. A weighted binary cross entropy was calculated with the object_matrix. This is loss1.
Second and third component (that is [:,:,1] and [:,:,2]) are passed through a sigmoid (to normalized it) and a L2 loss is computed with the center_x_matrix and the center_y_matrix. All the cell values that are 0 because they do not contain a face are ignored in the L2 loss. This is loss2.
Fourth and fifth component (that is [:,:,3] and [:,:,4]) are passed through a sigmoid (to normalized it) and a L2 loss is computed with the w_matrix and the h_matrix. All the cell values that are 0 because they do not contain a face are ignored in the L2 loss. This is loss3.
Final loss is loss1 + loss2 + loss3

5°) Training was done using Adam optimized with a fixed learning rate of 5e-5.

For more information about the architecture, you can read my recent post here.

You should be able to repeat this training process with any object class you wish to perform object detection with.

Hope this helps. Feel free to continue this discussion if you have any doubt.

qindj commented 6 years ago

thx!, need training scripts for learning +1

AtaaEddin commented 5 years ago

Hi @iitzco Thank you for sharing the code, I just want to ask you about the last fully connected layer in the Auxiliary network, it has five cells corresponding to (P,x,y,w,h) right?

can you explain more about this network because i want to do fine tuning starting from your model.

adellelin commented 5 years ago

Hi,

Thanks for writing this up. Can this model also be used to return features? Or does this only apply to recognize faces and return a bounding box.

Thanks!

mpottinger commented 5 years ago

I too am interested in training this. It is just what we need for custom object detection. Generic multi-class object detectors like Yolo are great for object detection competitions, but in the real world i care about detecting one or a few custom objects and doing it fast on a CPU. Stuff like this is needed to make Yolo type algorithms more than a toy and into a real tool.

arishin commented 5 years ago

Hi @iitzco, if there is any additional scripting you might have to setup training, would appreciate it. Thanks for the example - looking forward to training on other "few" object sets for fun (for me I'm looking for hands and heads...)

p30arena commented 5 years ago

@iitzco A little help here? training process "FREEZES" when using dropouts what should I do?

I also use an arbitrary cls to store one class

adam = tf.keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, decay=0.0)
model.compile(loss=custom_loss, optimizer=adam, metrics=['accuracy'])

i = tf.keras.layers.Input(shape=(img_shape, img_shape, 3))

x = tf.keras.layers.Conv2D(8, (9, 9), padding='same')(i)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Conv2D(8, (1, 1), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='valid')(x)
x = tf.keras.layers.Dropout(0.5)(x)

x = tf.keras.layers.Conv2D(16, (9, 9), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Conv2D(16, (1, 1), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='valid')(x)
x = tf.keras.layers.Dropout(0.5)(x)

x = tf.keras.layers.Conv2D(32, (9, 9), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Conv2D(32, (1, 1), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='valid')(x)
x = tf.keras.layers.Dropout(0.5)(x)

x = tf.keras.layers.Conv2D(64, (9, 9), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Conv2D(64, (1, 1), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='valid')(x)
x = tf.keras.layers.Dropout(0.5)(x)

x = tf.keras.layers.Conv2D(128, (9, 9), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Conv2D(128, (1, 1), padding='same')(x)
x = tf.keras.layers.LeakyReLU(alpha=0.3)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='valid')(x)
x = tf.keras.layers.Dropout(0.5)(x)

cx1 = tf.keras.layers.Conv2D(192, (9, 9), padding='same')(x)
cx1 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx1)
cx1 = tf.keras.layers.BatchNormalization()(cx1)
cx1 = tf.keras.layers.Dropout(0.5)(cx1)
cx1 = tf.keras.layers.Conv2D(192, (1, 1), padding='same')(cx1)
cx1 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx1)
cx1 = tf.keras.layers.BatchNormalization()(cx1)
cx1 = tf.keras.layers.Dropout(0.5)(cx1)

cx2 = tf.keras.layers.Conv2D(192, (9, 9), padding='same')(x)
cx2 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx2)
cx2 = tf.keras.layers.BatchNormalization()(cx2)
cx2 = tf.keras.layers.Dropout(0.5)(cx2)
cx2 = tf.keras.layers.Conv2D(192, (1, 1), padding='same')(cx2)
cx2 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx2)
cx2 = tf.keras.layers.BatchNormalization()(cx2)
cx2 = tf.keras.layers.Dropout(0.5)(cx2)

cx3 = tf.keras.layers.Conv2D(192, (9, 9), padding='same')(x)
cx3 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx3)
cx3 = tf.keras.layers.BatchNormalization()(cx3)
cx3 = tf.keras.layers.Dropout(0.5)(cx3)
cx3 = tf.keras.layers.Conv2D(192, (1, 1), padding='same')(cx3)
cx3 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx3)
cx3 = tf.keras.layers.BatchNormalization()(cx3)
cx3 = tf.keras.layers.Dropout(0.5)(cx3)

cx4 = tf.keras.layers.Conv2D(192, (9, 9), padding='same')(x)
cx4 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx4)
cx4 = tf.keras.layers.BatchNormalization()(cx4)
cx4 = tf.keras.layers.Dropout(0.5)(cx4)
cx4 = tf.keras.layers.Conv2D(192, (1, 1), padding='same')(cx4)
cx4 = tf.keras.layers.LeakyReLU(alpha=0.3)(cx4)
cx4 = tf.keras.layers.BatchNormalization()(cx4)
cx4 = tf.keras.layers.Dropout(0.5)(cx4)

cls = tf.keras.layers.Conv2D(1, (1, 1), padding='same')(cx1)
cls = tf.keras.backend.squeeze(cls, -1)
cls = tf.keras.activations.sigmoid(cls)

prob = tf.keras.layers.Conv2D(1, (1, 1), padding='same')(cx2)
prob = tf.keras.backend.squeeze(prob, -1)
prob = tf.keras.activations.sigmoid(prob)

xy_center = tf.keras.layers.Conv2D(2, (1, 1), padding='same')(cx3)
x_center = tf.keras.activations.sigmoid(xy_center[..., 0])
y_center = tf.keras.activations.sigmoid(xy_center[..., 1])

wh = tf.keras.layers.Conv2D(2, (1, 1), padding='same')(cx4)
w = tf.keras.activations.sigmoid(wh[..., 0])
h = tf.keras.activations.sigmoid(wh[..., 1])

x = tf.keras.layers.concatenate([cls, y_center, y_center, w, h, prob])
x = tf.keras.layers.Reshape((grid_shape * grid_shape, (nb_boxes * 5 + 1)))(x)

@tf.function
def custom_loss(y_true, y_pred):
  K = tf.keras.backend
  global grid_tensor
  if grid_tensor is None:
    grid_tensor = K.variable(grid)

  # # make sure all values are positive ?
  # y_pred = K.abs(y_pred)
  # y_true = K.abs(y_true)

  y_true_class = y_true[..., 0:1]
  y_pred_class = y_pred[..., 0:1]

  pred_boxes = K.reshape(y_pred[..., 1:], (-1, grid_shape * grid_shape, nb_boxes, 5))
  true_boxes = K.reshape(y_true[..., 1:], (-1, grid_shape * grid_shape, nb_boxes, 5))

  y_pred_xy = pred_boxes[..., 0:2] + grid_tensor
  y_pred_wh = pred_boxes[..., 2:4]
  y_pred_conf = pred_boxes[..., 4]

  y_true_xy = true_boxes[..., 0:2]
  y_true_wh = true_boxes[..., 2:4]
  y_true_conf = true_boxes[..., 4]

  clss_loss = K.sum(K.square(y_true_class - y_pred_class), axis=-1)
  xy_loss = K.sum(K.sum(K.square(y_true_xy - y_pred_xy), axis=-1) * y_true_conf, axis=-1)
  wh_loss = K.sum(K.sum(K.square(K.sqrt(y_true_wh) - K.sqrt(y_pred_wh)), axis=-1) * y_true_conf, axis=-1)

  # when we add the confidence the box prediction lower in quality but we gain the estimation of the quality of the box
  # however the training is a bit unstable

  # minimum distance between boxes distance between the two center
  intersect_wh = K.maximum(K.zeros_like(y_pred_wh), (y_pred_wh + y_true_wh) / 2 - K.abs(y_pred_xy - y_true_xy))
  intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
  true_area = y_true_wh[..., 0] * y_true_wh[..., 1]
  pred_area = y_pred_wh[..., 0] * y_pred_wh[..., 1]
  union_area = pred_area + true_area - intersect_area
  iou = intersect_area / union_area

  conf_loss = K.sum(K.square(y_true_conf * iou - y_pred_conf) * y_true_conf, axis=-1)

  d = clss_loss + xy_loss + wh_loss + conf_loss

  return d

iitzco / faced

How to train on new generic object class #3