google / aiyprojects-raspbian

API libraries, samples, and system images for AIY Projects (Voice Kit and Vision Kit)
https://aiyprojects.withgoogle.com/
Apache License 2.0
1.63k stars 694 forks source link

AIY Vision Kit compilation error (bonnet_model_compiler) #674

Open ricardodeazambuja opened 4 years ago

ricardodeazambuja commented 4 years ago

Hi,

I've trained the model you can see at the end of this comment and it works (I tested the frozen model with an image and everything was fine), but when I try to compile for the AIY I get this error message:

...
2020-02-27 19:59:27.446585: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: AddV2
2020-02-27 19:59:27.446601: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Unpack
2020-02-27 19:59:27.446633: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: AddV2
2020-02-27 19:59:27.446644: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: AddV2
2020-02-27 19:59:27.446665: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: AddV2
2020-02-27 19:59:27.446678: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: AddV2
2020-02-27 19:59:27.461698: F external/org_tensorflow/tensorflow/contrib/lite/toco/tooling_util.cc:822] Check failed: d >= 1 (0 vs. 1)

I'm using these input arguments for the compiler:

  --input_tensor_name=image_tensor \
  --output_tensor_names=raw_detection_boxes \
  --input_tensor_size=256 \

I went through bonnet_model_compiler.par (unzipped, modified the python2.7 code, etc), but the important piece of code that is generating the error is tool_a.bin (b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00...). Which part of my graph is generating the error? Is it the op that comes just after that AddV2? Is it too big? Is it too ugly? :)

Any help will be very much appreciated ;)

Cheers,

Ricardo

P.S. the config file for the model:

# SSDLite with Mobilenet v3 small feature extractor.
# Trained on VOC2012, initialized from scratch.

# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 20 # number of classes available on VOC2012
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 256 # I will be using picamera and this size is fast.
        width: 256
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
        use_depthwise: true
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.97,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v3_small'
      min_depth: 16
      depth_multiplier: 0.125 # 
      use_depthwise: true
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.75,
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
        use_static_shapes: true
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 256 # if you are running out of memory, reduce this value.
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 32
  num_steps: 800000 # limits the training process to 800K steps. 
                    # Not sure which number would be the best here...
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  data_augmentation_options {
    random_black_patches {
    }
  }
  data_augmentation_options {
    random_distort_color{
    }
  }
  data_augmentation_options {
    random_jitter_boxes{
    }
  } 
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: 0.4
          total_steps: 800000
          warmup_learning_rate: 0.13333
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 10 # it was 100, but I really don't want to detect 100 things at once
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "pascal_train.record"
  }
  label_map_path: "object_detection/data/pascal_label_map.pbtxt"
  queue_capacity: 50 
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "pascal_val.record"
  }
  label_map_path: "object_detection/data/pascal_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
ricardodeazambuja commented 4 years ago

Ok, if nobody from Google wants to help or simply doesn't have time to help, why don't you release the full source code for the bonnet_model_compiler?

0russ commented 4 years ago

Hello, I have the same problem when trying to generate a binaryproto file, have you solved this problem now? I would appreciate it if you can share your solution, thank you very much! @ricardodeazambuja

ricardodeazambuja commented 4 years ago

Hi @0russ, I wouldn't say I really solved the problem because my aim was to automatically replace the operations that make the compiler crash, but after many hours testing I found a way to make it work. Basically, the problem seems connected to new operations added to TF > 1.14. I was able to train the model using TF 1.15.X, but I need to freeze it using TF <= 1.14.X (I can't remember the Xs now...). I'm creating a series of notebooks that go step-by-step from the data collection, labelling, training and the deployment to the AIY Vision Bonnet.... but I haven't pushed to github yet. I will polish and push as a tutorial to a repo in this project https://github.com/thecognifly, probably in one or two weeks ;)

0russ commented 4 years ago

I thought it might because of Hard swish, it is a new activation function proposed in mobilenet_v3.This Bonnet model compile may not support this new activation function.....I'll try to lower my TF version and train the model, thanks again and looking forward to your tutorial! @ricardodeazambuja

ricardodeazambuja commented 4 years ago

Sorry @0russ, I meant the original mobilenet embed config file. The V3 will not work because of the new ops it is using and also the final size (I think I compared the sizes and V3 was bigger even with depth_multiplier: 0.125 because the ssd anchor generator has more layers, etc... I can't remember).

0russ commented 4 years ago

OK,I see. It's all right, thank you for your reminding.@ricardodeazambuja

martinsipka commented 4 years ago

@ricardodeazambuja Hello I am facing similar problems. I am using the current version of object detection from GitHub and I tried Tensorflow 1.14.0 and 1.15.0. Version 1.15.0 produces even more errors. Output from 1.14.0 is below. Can you please tell me what did you use exactly to make it work?

The series of notebook tutorials you mentioned might be very useful.

Thank you.

2020-10-18 11:58:52.493807: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayV3
2020-10-18 11:58:52.493918: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayScatterV3
2020-10-18 11:58:52.493942: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayV3
2020-10-18 11:58:52.493968: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.493983: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.493995: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.494016: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.494031: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: LogicalAnd
2020-10-18 11:58:52.494040: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: LoopCond
2020-10-18 11:58:52.494072: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayReadV3
2020-10-18 11:58:52.494085: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.494096: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.494124: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayWriteV3
2020-10-18 11:58:52.494138: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Enter
2020-10-18 11:58:52.494164: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: Exit
2020-10-18 11:58:52.494176: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArraySizeV3
2020-10-18 11:58:52.494198: I external/org_tensorflow/tensorflow/contrib/lite/toco/import_tensorflow.cc:1268] Converting unsupported operation: TensorArrayGatherV3
2020-10-18 11:58:52.499753: F external/org_tensorflow/tensorflow/contrib/lite/toco/tooling_util.cc:822] Check failed: d >= 1 (0 vs. 1)
ricardodeazambuja commented 3 years ago

I finally pushed all the notebooks and instructions here: https://github.com/thecognifly/AIYVisionKit_Utils