sonos / tract

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference
Other
2.23k stars 214 forks source link

MobileNet ops not supported #83

Closed ehsanmok closed 5 years ago

ehsanmok commented 5 years ago

Hi

I wanted to run the pretrained frozen .pb models from mobilenetv1 and mobilenetv2 with

let tfd = ::tract_tensorflow::tensorflow().model_for_path(mobilenetv1_frozen).unwrap();
let plan = ::tract::SimplePlan::new(&tfd).unwrap();
let input = load_image(img);
let outputs = plan.run(tvec![input]).unwrap();

But for MobilenetV1 I get

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TractError(Msg("Evaluating #13 \"MobilenetV1/MobilenetV1/Conv2d_0/Relu6\" Unimplemented(Relu6): unimplemented operation: Relu6"), State { next_error: None, backtrace: InternalBacktrace })', src/libcore/result.rs:997:5

and for MobilenetV2

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TractError(Msg("Node named MobilenetV2/Conv/BatchNorm/FusedBatchNorm not found"), State { next_error: None, backtrace: InternalBacktrace })', src/libcore/result.rs:997:5

Any plan to support Relu6 or FusedBatchNorm? Would you be willing to point me where can I add those?

kali commented 5 years ago

Thanks for your interest in tract !

No big surprise here, tensorflow support is on a "per-application" basis as there is no way tract will support tensorflow entirely.

ehsanmok commented 5 years ago

Thanks for the tips!

For MobilenetV1, I added the Relu6 but another error poped up,

Evaluating #16 \"MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise\"Unimplemented(DepthwiseConv2dNative): unimplemented operation: DepthwiseConv2dNative")

Depthwise conv is the main op in both MobilenetV1/2.

For MobilenetV2, it seems there's no FusedBatchNorm defined in core/src/ops/nn/. There're BatchNorm and FixedBatchNorm. Basically, it needs to be added separately as conv2 + bn, I think.

kali commented 5 years ago

Aha :) That one will be a bit more challenging than Relu6... We need a separate implementation for dw conv2d, and I agree it would make sense to have it. I would love to add mobilenet 1 and 2 to the supported networks and get one more opportunity to compare to TFLite and whatever.

I may be able to work on this in a few weeks. I'm a bit deep in rnn right now... But I will help if you feel like giving it a shot in the meantime.

The BatchNorm thing seems easier to deal with (probably a matter of reorganising some operators to the one in core) but I don't think it will bring us much as long as we don't have the DW conv.

kali commented 5 years ago

Hey @ehsanmok, you may want to give a shot to the mobilenet branch. See #89. I think the network works. The performance is sub-par for now, I need to plug the new depthwise operators on the optimized convolution backends, I will try to do that soon-ish.

kali commented 5 years ago

(that was for v1, there is still the weird FuseBatchNorm issue with mobilenet v2)

kali commented 5 years ago

Also fixed the FusedBatchNorm issue. As I was expected, it was not a missing op thing, but MobileNet strangely declares its node in disorder. First network I see like that. So now v1 and v2 are both working correctly, without full optimisation for now.

ehsanmok commented 5 years ago

Hi @kali, great, thanks! tried the patch and it works :)

Knowing it's not optimized though V2 is slightly slower than V1 (which shouldn't be the case AFAIK).

Looking forward to seeing on par performance with TFLite.

kali commented 5 years ago

@ehsanmok just wanted to let you know that I merged #92. It plugs the dwconv on the regular convolution backend. Performance is better, but still not at the level I'd like: the backend will need a bit of work to handle more efficiently the specific kernel size and channel count induced by depthwise convolutions. Anyway, if you're using it, it may be worth bumping to the top of tree.

kali commented 5 years ago

don't rush it, there is a bug.

kali commented 5 years ago

nailed it.

ehsanmok commented 5 years ago

Thanks for letting me know! I tried the optimized one based on the tf example, (+ release mode) was slower than the unoptimized one. I'm afraid it wasn't a complete benchmark on aarch64-linux-gnu toolchain though. I'm assume based on the signature optimization pre-allocate stuff and maybe more?

kali commented 5 years ago

You tried the master, right ? not gemm-for-1-m ? that one is not ready for prime time.

You're saying that with tract compiled in release, running the unoptimized network is faster than running the optimized one ?

ehsanmok commented 5 years ago

Yes, I got it from master. I didn't notice any optimization benefit using tfd.into_optimized() in tf MobilenetV2 example. Is there anything else I should be doing?

Here's the snippet I ran without loading the labels

let mut tfd = ::tract_tensorflow::tensorflow().model_for_path(mobilenet_v2()).unwrap();
tfd.set_input_fact(0, TensorFact::dt_shape(f32::datum_type(), &[1, 224, 224, 3])).unwrap();
let tfd = tfd.into_optimized().unwrap();
let plan = SimplePlan::new(&tfd).unwrap();
let input = load_image(input_image());
let outputs = plan.run(tvec![input]).unwrap();
ehsanmok commented 5 years ago

Btw may I ask how the assemblies in linalg were made and used?

kali commented 5 years ago

Thanks for the clarification. I'll have a look.

kali commented 5 years ago

They were made with a lot of love, and they are used, when possible, for direct convolution, or for matrix multiplication. The general idea is that convolution is usually translated to a im2col + matmul (and so will use the smm kernels), but for some valid 1d and 2d case, it is possible to use the direct convolution ("sconv") kernel instead.

On the gemm-1-m branch, I'm focusing on the two main convolutions cases used by mobilenet. There is the depthwise one, and the pointwise one. It is straightforward to translate the pointwise to a simple matmul, the depthwise needs a bit more work...

kali commented 5 years ago

I observe the same thing on intel on master, the optimized network is 10% slower than the plain one :) Thanks for noticing this :)

kali commented 5 years ago

All right, I know what happened on master: the "naive" specific implementation I did for the depthwise TF conv is actually relatively good, better than the generic convolution backend. Hopefully, what I'm doing in optim-dw-conv should tip the scale back in the right direction.

kali commented 5 years ago

Hey, wanted to share some progress on mobilenet optimisation. These benches run on Raspberry pi 3 / raspbian.

image
kali commented 5 years ago

Hey, I'm going to close this issues. Some more optimisation may come but they will be part of non mobilenet specific things that i have in mind.