unrealwill / tensorflow-csharp-c-api

Port of the tensorflow c api to csharp
Apache License 2.0
22 stars 3 forks source link

Updated version? #1

Open jspauld opened 8 years ago

jspauld commented 8 years ago

Any chance you have an updated version that works with the latest version of Tensorflow? This is the only c# wrapper I can find and I'm having quite a difficult time getting it working.

unrealwill commented 8 years ago

No sorry, this was a one week-end project. It should help you get starting if you want to implement your own wrapper. Tensorflow works best with python. The quickest way to get useful results in c# is with some rpc. You can have a look at official tensorflow serving, or create your own python server as an interface with tensorflow using zeromq for example. Depending what you want to do, tensorflow may not be the most appropriate choice.

jspauld commented 8 years ago

Hey thanks for your reply. I was able to get things working using official tensorflow serving through a Docker container. However, this seems to have very high overhead as far as performance. My goal is to be able to run inference as fast as possible from C#.. this is for a stock trading application where performance is critical. I'm running relatively simple models and the overhead of going through tensorflow serving was greater than the time running the model itself.

jspauld commented 8 years ago

For anyone else reading, I think one of the problems I was having getting this wrapper working is related to this: https://github.com/tensorflow/tensorflow/issues/3814

unrealwill commented 8 years ago

Hello, from a theoretical point of view performance, tensorflow serving shouldn't be too bad.

My guess is that you have a high latency because the batching scheduler is waiting for more requests to be able to serve them in a batch to increase the throughput. Have a look at to optimize your latency : https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md

If you have a server, which is polling in the cpu (given a high enough priority (to both the server and your app to avoid a potential 10ms latency associated with context switches), the added latency should be in the order of 0.1 ms (for localhost or inside the same datacenter). Unless you are doing high frequency trading (which btw you shouldn't be doing from c# anyway using random code from the internet), this shouldn't be a problem. There shouldn't be a lot of data to transfer for each request.

jspauld commented 8 years ago

I measured the overhead of using tensorflow serving through Docker on my macbook pro at 0.6ms. I was able to do this by generating a dummy model returning a+b. According to the documentation, batching is turned off by default so I don't think that's the problem. Perhaps it's that I'm using Docker or a mac.

In any case, I'd definitely like to avoid a 0.1ms overhead. I AM doing high frequency trading in the sense that that matters.

From what I've read, C# is generally not much slower than C++. While I understand a skilled C++ developer will be able to squeeze better performance from their code, at this point I think my time is better spent trying to avoid a 0.1-0.6ms tensorflow serving overhead... I think that dwarfs any potential gains from not using C#.

Not sure why you're saying not to use "random code on the internet" -- pretty sure everybody does that!

unrealwill commented 8 years ago

0.6 ms is fine for most applications. To get to 0.1ms with rpc, you need to open the tcp socket before hand and reuse it for multiple request (and making sure Nagle's algorithm is disabled). Zeromq does that, (I don't know if tensorflow serving does that or not).

To get below 0.1ms you would indeed have to use a wrapper. Tensorflow is a pain to compile, changes often (as you noticed), and c++ functionality is minimal. For c++, you may use torch/caffe/mxnet which are more user friendly, which can then be wrapped more easily.

The main issue with C# for real time application is not speed but lack of control over latency due to the garbage collector, (which can incur a random delay (though it can be mitigated by not allocating memory too often, and manually triggering GC at specific time, it's still a Damocles sword hitting you at the worst of time when a market crash occurs -> more network messages -> c# suddenly needs to GC -> you get hit by a huge slippage)).

When you are going with sub-ms latency, you'll be competing with the big boys. Which means you have to be physically near the exchange because latency will be dominated by the network latency (which cost big $).

You won't get an edge against funds and banks who have spent millions ensuring clean path to the exchange, and dedicated circuits (FPGA) to achieve ultra low latency. These guys don't use random code from the internet.

You may have an edge an outsmarting them using advanced ML, but then it's not the 1ms latency which makes the difference. Though the competition is hard, they are moving so much money that there are crumbs to pick up for smaller players.

jspauld commented 8 years ago

Thanks for the tips! I've done this stuff before -- co-located at exchange which didn't necessarily cost that much money. It's safe to say I have no intention to compete against the big boys as far as out-speeding them. And yeah, in a sense I will be "picking up crumbs".. assuming it works at all.

Sounds like you have some experience in this industry?

I realized I don't necessarily have to make a C# wrapper. I can just write C++ code that loads a graph and starts a session.. then I only need to use p invoke for 1 call -- and that's the inference call. Of course, there seems to be very little documentation on even loading a graph in C++.

If you happen to know of a more appropriate library besides Tensorflow please let me know. I believe tf is designed for distributed computing and may have overhead even if I'm running it locally and directly.

jspauld commented 8 years ago

There is http://accord-framework.net/ for C# but the fact that it is completely written in C# makes me think it will be slow.. Planning to test that against tensorflow performance.